In [186]:
!jupyter nbconvert  Project3_data_3.ipynb --to html
[NbConvertApp] Converting notebook Project3_data_3.ipynb to html
[NbConvertApp] Writing 9028375 bytes to Project3_data_3.html

PROJECT 3-DATA 3

Soufiane Fadel and Razieh Pourhasan

1. Intoduction

In this project, we are going to clean and process the following dataset.

  • data_2 : HR_2016_Census_simple

2.1 Load libraries

We import pandas to work with our data, Matplotlib to plot charts, and Seaborn to make our charts prettier.

In [109]:
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
import matplotlib.pyplot as plt
import scipy.stats as ss
import seaborn as sns
import plotly.offline as py
import plotly.graph_objs as go
color = sns.color_palette()
sns.set(style="darkgrid")
import scipy.stats as ss
import matplotlib.pyplot as plt
from collections import Counter
from dython._private import convert, remove_incomplete_samples, replace_nan_with_value
from dython.nominal import associations

2.2 Read the Data

Let's load the "HR_2016_Census_simple" which has been provided in datasets for the course. We load the spread sheet:

In [110]:
xls_file = pd.ExcelFile('HR_2016_Census_simple.xlsx')
xls_file
Out[110]:
<pandas.io.excel._base.ExcelFile at 0x1235ce358>
In [111]:
xls_file.sheet_names
Out[111]:
['Data',
 'CensusMnemonics(2A Short form)',
 'CensusMnemonics(2A-L Long form)',
 'Mnemonics for additional var.',
 'LHINs']
In [112]:
rawdf = xls_file.parse('Data')
rawdf.head(10)
Out[112]:
Geocode Geo Name GNR POP_2016 POP_MIZ POP_MIZ_RATE POP_LRG_POP_CNTR LRG_POP_CNTR_RATE POP_MED_POP_CNTR MED_POP_CNTR_RATE POP_SML_POP_CNTRE SML_POP_CNTRE_RATE POP_RRL RRL_RATE POP_DENSE CENSUS_FAM_PRIV_HHLDS LONE LONE_RATE LONE_FEMALE LONE_FEMALE_RATE LONE_MALE LONE_MALE_RATE GNR.1 POP_PRIV_HHLD_2016 POP_25_29 HSG HSG_RATE HSG_ONLY HSG_ONLY_RATE HSG_PSG_BELOW_BACH HSG_PSG_BELOW_BACH_RATE HSG_PSG_BACH HSG_PSG_BACH_RATE HSG_PSG_ABOVE_BACH HSG_PSG_ABOVE_BACH_RATE POP_25_54 PSG PSG_RATE PSG_BELOW_BACH PSG_BELOW_BACH_RATE PSG_BACH PSG_BACH_RATE PSG_ABOVE_BACH PSG_ABOVE_BACH_RATE LABOUR UE UE_RATE ECON_FAM_LW_INC_DENOM LW_INC_ECON_FAM LW_INC_ECON_FAM_RATE UNATTACHED_LW_INC_DENOM LOW_INC_UNATTACHED LOW_INC_UNATTACHED_RATE POP_PRIV_HHLDS_LW_INC_DENOM POP_PRIV_HHLDS_LW_INC POP_PRIV_HHLDS_LW_INC_RATE CHILD17_ECON_FAM_LOW_INC_DENOM CHILD17_ECON_FAM_LOW_INC CHILD17_ECON_FAM_LOW_INC_RATE AVE_PERS_INC AVE_PERS_INC_MALE AVE_PERS_INC_FEMALE AVE_DWELL RENTER_SHELCO_HHLD RENTER_OVER30 RENTER_OVER30_RATE OWNER_SHELCO_HHLD OWNER_OVER30 OWNER_OVER30_RATE MEDIAN_HHLD_INC INC_SHR_BELOW_MEDIAN GOVT_TRNSFR_INC ABO ABO_RATE IMM IMM_RATE IMM_2006_2016 IMM_2006_2016_RATE_IMM IMM_2006_2016_RATE_TOTPOP VIS_MIN VISMIN_RATE POP_ONE_PLUS ONE_YR_INT_MIG ONE_YR_INT_MIG_RATE POP_FIVE_PLUS FIVE_YR_INT_MIG FIVE_YR_INT_MIG_RATE POP16 POP11 GROWTH POP16_2A POP_0_19 PCT_0_19 POP_65_PLUS PCT_65_PLUS MALE FEMALE MF_RATE POP_25_54.1 EMP_25_54 EMP_RATE_25_54 SHELCO_HH OVER30_HH HOU_AFF UNEMP_RATE
0 1 Canada 4.0 35151730 31226645 88.8 20938295 59.6 3179190 9.0 4458180 12.7 6576060 18.7 3.9 9840725 1612805 16.4 1262335 12.8 350465 3.6 5.1 34460065 2266030 2033665 89.7 529485 26.0 722825 35.5 581550 28.6 199815 9.8 14076325 9562680 67.9 5144695 53.8 2967120 31.0 1450860 15.2 18672470 814645 4.4 9573075 839765 8.8 5252515 1729655 32.9 33968190 4324825 12.7 6821825 961010 14.1 47487 56740 38632 443058 4452850 1775570 40.0 9368330 1550380 16.6 70332 20.5 11.7 1673780 4.9 7540830 21.9 2268170 30.1 6.6 7674580 22.3 34091785 1451000 4.3 32568565 4296715 13.2 35151728 33476688 5.0 35151730 7865725 22.4 5935630 16.9 17264200 17887530 1.0 14076325 11323575 80.4 13821180 3325950 40.0 7.7
1 10 Newfoundland and Labrador / Terre-Neuve-et-Lab... 4.0 519715 312530 60.1 178430 34.3 0 0.0 123290 23.7 218000 41.9 1.4 161005 24250 15.1 19290 12.0 4955 3.1 6.8 512250 28130 25455 90.5 6375 25.0 11515 45.2 5950 23.4 1615 6.3 203385 134670 66.2 91545 68.0 28820 21.4 14305 10.6 256855 24455 9.5 158975 9415 5.9 69330 19125 27.6 509340 45045 8.8 89210 10240 11.5 45210 56724 34259 243157 50415 19810 39.4 167335 17695 10.6 67320 24.2 16.5 45725 8.9 12075 2.4 5540 45.9 1.1 11815 2.3 507965 22185 4.4 489800 66305 13.5 519716 514536 1.0 519715 101695 19.6 101030 19.4 253925 265790 1.0 203385 146440 72.0 217750 37500 39.4 15.6
2 1011 1011 Eastern Regional Health Authority 3.7 313270 235790 75.3 178425 57.0 0 0.0 32520 10.4 102325 32.7 16.2 94510 15045 15.9 12025 12.7 3020 3.2 6.5 309015 19165 17785 92.8 4175 23.5 7470 42.0 4780 26.9 1360 7.6 127455 89625 70.3 56775 63.3 21620 24.1 11225 12.5 161620 11245 7.0 94065 5740 6.1 44770 13255 29.6 309015 29230 9.5 55415 6570 11.9 48517 60868 36888 285746 31990 13000 40.8 98985 11875 12.0 72211 20.3 13.8 8635 2.8 9225 3.0 4360 47.3 1.4 9430 3.1 306240 13165 4.3 294790 40710 13.8 313267 306517 2.2 313270 62495 19.9 56040 17.9 152620 160650 1.0 127455 96710 75.9 130975 24880 40.8 11.9
3 1012 1012 Central Regional Health Authority 4.1 92690 43275 46.7 0 0.0 0 0.0 31860 34.4 60835 65.6 2.2 30630 3570 11.7 2825 9.2 740 2.4 6.8 91160 3430 2895 84.4 905 31.3 1525 52.7 370 12.8 85 2.9 32955 18250 55.4 14490 79.4 2545 13.9 1220 6.7 41070 6080 14.8 30070 1705 5.7 10295 2535 24.6 90205 7165 7.9 14680 1660 11.3 37795 48390 27521 162012 7510 2985 39.9 31920 2625 8.2 56442 21.9 24.2 5270 5.8 855 0.9 295 34.5 0.3 780 0.9 90575 3995 4.4 88000 11715 13.3 92690 93642 -1.0 92685 16690 18.0 22100 23.8 45450 47240 1.0 32955 20900 63.4 39430 5610 39.9 23.1
4 1013 1013 Western Regional Health Authority 4.7 77685 33470 43.1 0 0.0 0 0.0 41830 53.8 35855 46.2 2.5 24655 4065 16.5 3275 13.3 790 3.2 8.1 76520 3315 2980 89.9 845 28.4 1520 51.0 495 16.6 120 4.0 27715 17280 62.3 12855 74.4 3130 18.1 1295 7.5 35455 4485 12.6 24275 1650 6.8 11025 2810 25.5 76525 7185 9.4 12240 1605 13.1 38028 46608 29950 195672 8065 3375 41.9 25995 2665 10.2 55950 21.5 22.9 19510 25.5 1040 1.4 360 34.6 0.5 685 0.9 75995 3820 5.0 73560 10170 13.8 77687 77983 -0.4 77685 13945 18.0 17840 23.0 37710 39975 0.9 27715 18205 65.7 34055 6040 41.9 21.2
5 1014 1014 Labrador-Grenfell Regional Health Authority 4.7 36070 0 0.0 0 0.0 0 0.0 17080 47.4 18995 52.7 0.1 11210 1570 14.0 1165 10.4 410 3.7 6.9 35550 2220 1795 80.9 440 24.5 1005 56.0 305 17.0 45 2.5 15260 9515 62.4 7425 78.0 1530 16.1 560 5.9 18710 2645 14.1 10565 320 3.0 3235 525 16.2 33600 1470 4.4 6865 400 5.8 51808 64955 38477 205671 2850 445 15.6 10435 525 5.0 94659 23.7 12.8 12305 34.6 955 2.7 520 54.5 1.5 920 2.6 35155 1210 3.4 33445 3700 11.1 36072 36394 -0.9 36070 8565 23.7 5045 14.0 18145 17925 1.0 15260 10635 69.7 13290 970 15.6 20.3
6 11 Prince Edward Island / Île-du-Prince-Édouard 4.1 142910 111240 77.8 0 0.0 44725 31.3 19665 13.8 78515 54.9 25.1 41700 6545 15.7 5155 12.4 1385 3.3 5.6 139685 7525 7030 93.4 2105 29.9 2450 34.9 1995 28.4 485 6.9 52555 34620 65.9 20590 59.5 9850 28.5 4180 12.1 77675 5930 7.6 41400 2480 6.0 20865 5755 27.6 139090 13175 9.5 27445 3065 11.2 38899 43979 34171 197966 17535 6335 36.3 40830 4510 11.1 61026 23.2 17.6 2740 2.0 8945 6.4 4695 52.5 3.4 6640 4.8 138370 7975 5.8 132700 20535 15.5 142907 140204 1.9 142910 31330 21.9 27710 19.4 69305 73605 0.9 52555 41790 79.5 58370 10850 36.3 12.3
7 1100 1100 Prince Edward Island / Île-du-Prince-Édouard 4.1 142905 111235 77.8 0 0.0 44725 31.3 19665 13.8 78515 54.9 25.1 41705 6545 15.7 5160 12.4 1390 3.3 5.6 139690 7520 7035 93.6 2110 30.0 2450 34.8 1995 28.4 480 6.8 52555 34620 65.9 20590 59.5 9850 28.5 4180 12.1 77675 5930 7.6 41400 2480 6.0 20865 5750 27.6 139090 13175 9.5 27450 3065 11.2 38899 43979 34171 197966 17540 6335 36.3 40830 4515 11.1 61026 23.2 17.6 2735 2.0 8940 6.4 4690 52.5 3.4 6645 4.8 138370 7975 5.8 132700 20530 15.5 142907 140204 1.9 142910 31335 21.9 27710 19.4 69305 73600 0.9 52555 41795 79.5 58370 10850 36.3 12.3
8 12 Nova Scotia / Nouvelle-Écosse 3.9 923600 680210 73.6 316690 34.3 0 0.0 213230 23.1 393680 42.6 17.4 270965 46900 17.3 37595 13.9 9305 3.4 5.5 908340 53305 49100 92.1 13980 28.5 17515 35.7 13815 28.1 3785 7.7 349435 234890 67.2 135785 57.8 68550 29.2 30555 13.0 474595 26770 5.6 267115 18820 7.0 156220 45090 28.9 898365 100145 11.1 159565 21790 13.7 41479 49413 34132 230441 123065 52535 42.8 273525 32780 12.0 60745 30.5 15.3 51490 5.7 55680 6.1 19870 35.7 2.2 58655 6.5 900455 33715 3.7 866425 94885 11.0 923598 921727 0.2 923600 185085 20.0 183820 19.9 446880 476715 0.9 349435 274335 78.5 396590 85315 42.8 10.0
9 1201 1201 Zone 1 - Western 4.1 194375 45945 23.6 0 0.0 0 0.0 56110 28.9 138265 71.1 11.0 60065 8885 14.8 6975 11.6 1915 3.2 6.5 190775 8425 7450 88.4 2480 33.3 3280 44.0 1355 18.2 335 4.5 67250 40740 60.6 28870 70.9 8225 20.2 3645 8.9 94220 5425 5.8 59555 3760 6.3 30640 7785 25.4 190080 18350 9.7 31850 4035 12.7 36861 44559 29585 196741 20160 8185 40.6 65535 7810 11.9 53832 22.6 19.5 14185 7.4 8840 4.6 2055 23.2 1.1 4690 2.5 189270 10185 5.4 182900 26655 14.6 194376 197425 -1.5 194375 36380 18.7 47105 24.2 94430 99950 0.9 67250 51765 77.0 85690 15995 40.6 10.4
In [113]:
print('The dataset HR_2016_Census_simple has {} rows and {} features'.format(rawdf.shape[0],rawdf.shape[1]))
The dataset HR_2016_Census_simple has 127 rows and 105 features
In [114]:
print('The list names of the features are : {}  '.format(rawdf.columns.values))
The list names of the features are : ['Geocode' 'Geo Name' 'GNR' 'POP_2016' 'POP_MIZ' 'POP_MIZ_RATE'
 'POP_LRG_POP_CNTR' 'LRG_POP_CNTR_RATE' 'POP_MED_POP_CNTR'
 'MED_POP_CNTR_RATE' 'POP_SML_POP_CNTRE' 'SML_POP_CNTRE_RATE' 'POP_RRL'
 'RRL_RATE' 'POP_DENSE' 'CENSUS_FAM_PRIV_HHLDS' 'LONE' 'LONE_RATE'
 'LONE_FEMALE' 'LONE_FEMALE_RATE' 'LONE_MALE' 'LONE_MALE_RATE' 'GNR.1'
 'POP_PRIV_HHLD_2016' 'POP_25_29' 'HSG' 'HSG_RATE' 'HSG_ONLY'
 'HSG_ONLY_RATE' 'HSG_PSG_BELOW_BACH' 'HSG_PSG_BELOW_BACH_RATE'
 'HSG_PSG_BACH' 'HSG_PSG_BACH_RATE' 'HSG_PSG_ABOVE_BACH'
 'HSG_PSG_ABOVE_BACH_RATE' 'POP_25_54' 'PSG' 'PSG_RATE' 'PSG_BELOW_BACH'
 'PSG_BELOW_BACH_RATE' 'PSG_BACH' 'PSG_BACH_RATE' 'PSG_ABOVE_BACH'
 'PSG_ABOVE_BACH_RATE' 'LABOUR' 'UE' 'UE_RATE' 'ECON_FAM_LW_INC_DENOM'
 'LW_INC_ECON_FAM' 'LW_INC_ECON_FAM_RATE' 'UNATTACHED_LW_INC_DENOM'
 'LOW_INC_UNATTACHED' 'LOW_INC_UNATTACHED_RATE'
 'POP_PRIV_HHLDS_LW_INC_DENOM' 'POP_PRIV_HHLDS_LW_INC'
 'POP_PRIV_HHLDS_LW_INC_RATE' 'CHILD17_ECON_FAM_LOW_INC_DENOM'
 'CHILD17_ECON_FAM_LOW_INC' 'CHILD17_ECON_FAM_LOW_INC_RATE' 'AVE_PERS_INC'
 'AVE_PERS_INC_MALE' 'AVE_PERS_INC_FEMALE' 'AVE_DWELL'
 'RENTER_SHELCO_HHLD' 'RENTER_OVER30' 'RENTER_OVER30_RATE'
 'OWNER_SHELCO_HHLD' 'OWNER_OVER30' 'OWNER_OVER30_RATE' 'MEDIAN_HHLD_INC'
 'INC_SHR_BELOW_MEDIAN' 'GOVT_TRNSFR_INC' 'ABO' 'ABO_RATE' 'IMM'
 'IMM_RATE' 'IMM_2006_2016' 'IMM_2006_2016_RATE_IMM'
 'IMM_2006_2016_RATE_TOTPOP' 'VIS_MIN' 'VISMIN_RATE' 'POP_ONE_PLUS'
 'ONE_YR_INT_MIG' 'ONE_YR_INT_MIG_RATE' 'POP_FIVE_PLUS' 'FIVE_YR_INT_MIG'
 'FIVE_YR_INT_MIG_RATE' 'POP16' 'POP11' 'GROWTH' 'POP16_2A' 'POP_0_19'
 'PCT_0_19' 'POP_65_PLUS' 'PCT_65_PLUS' 'MALE' 'FEMALE' 'MF_RATE'
 'POP_25_54.1' 'EMP_25_54' 'EMP_RATE_25_54' 'SHELCO_HH' 'OVER30_HH'
 'HOU_AFF' 'UNEMP_RATE']  
In [115]:
print('Information about the type features:')
rawdf.info() 
Information about the type features:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 127 entries, 0 to 126
Columns: 105 entries, Geocode to UNEMP_RATE
dtypes: float64(43), int64(61), object(1)
memory usage: 104.3+ KB
In [116]:
num_features = rawdf.select_dtypes(include=['int64','float64'])
categorical_features = rawdf.select_dtypes(include='object')
print('The dataset has {} numerical feature(s).  '.format(num_features.shape[1]))
print('The dataset has {} categorical feature(s). '.format(categorical_features.shape[1]))
The dataset has 104 numerical feature(s).  
The dataset has 1 categorical feature(s). 
In [117]:
rawdf.describe()
Out[117]:
Geocode GNR POP_2016 POP_MIZ POP_MIZ_RATE POP_LRG_POP_CNTR LRG_POP_CNTR_RATE POP_MED_POP_CNTR MED_POP_CNTR_RATE POP_SML_POP_CNTRE SML_POP_CNTRE_RATE POP_RRL RRL_RATE POP_DENSE CENSUS_FAM_PRIV_HHLDS LONE LONE_RATE LONE_FEMALE LONE_FEMALE_RATE LONE_MALE LONE_MALE_RATE GNR.1 POP_PRIV_HHLD_2016 POP_25_29 HSG HSG_RATE HSG_ONLY HSG_ONLY_RATE HSG_PSG_BELOW_BACH HSG_PSG_BELOW_BACH_RATE HSG_PSG_BACH HSG_PSG_BACH_RATE HSG_PSG_ABOVE_BACH HSG_PSG_ABOVE_BACH_RATE POP_25_54 PSG PSG_RATE PSG_BELOW_BACH PSG_BELOW_BACH_RATE PSG_BACH PSG_BACH_RATE PSG_ABOVE_BACH PSG_ABOVE_BACH_RATE LABOUR UE UE_RATE ECON_FAM_LW_INC_DENOM LW_INC_ECON_FAM LW_INC_ECON_FAM_RATE UNATTACHED_LW_INC_DENOM LOW_INC_UNATTACHED LOW_INC_UNATTACHED_RATE POP_PRIV_HHLDS_LW_INC_DENOM POP_PRIV_HHLDS_LW_INC POP_PRIV_HHLDS_LW_INC_RATE CHILD17_ECON_FAM_LOW_INC_DENOM CHILD17_ECON_FAM_LOW_INC CHILD17_ECON_FAM_LOW_INC_RATE AVE_PERS_INC AVE_PERS_INC_MALE AVE_PERS_INC_FEMALE AVE_DWELL RENTER_SHELCO_HHLD RENTER_OVER30 RENTER_OVER30_RATE OWNER_SHELCO_HHLD OWNER_OVER30 OWNER_OVER30_RATE MEDIAN_HHLD_INC INC_SHR_BELOW_MEDIAN GOVT_TRNSFR_INC ABO ABO_RATE IMM IMM_RATE IMM_2006_2016 IMM_2006_2016_RATE_IMM IMM_2006_2016_RATE_TOTPOP VIS_MIN VISMIN_RATE POP_ONE_PLUS ONE_YR_INT_MIG ONE_YR_INT_MIG_RATE POP_FIVE_PLUS FIVE_YR_INT_MIG FIVE_YR_INT_MIG_RATE POP16 POP11 GROWTH POP16_2A POP_0_19 PCT_0_19 POP_65_PLUS PCT_65_PLUS MALE FEMALE MF_RATE POP_25_54.1 EMP_25_54 EMP_RATE_25_54 SHELCO_HH OVER30_HH HOU_AFF UNEMP_RATE
count 127.000000 127.000000 1.270000e+02 1.270000e+02 127.000000 1.270000e+02 127.000000 1.270000e+02 127.000000 1.270000e+02 127.000000 1.270000e+02 127.000000 127.00000 1.270000e+02 1.270000e+02 127.000000 1.270000e+02 127.000000 127.000000 127.000000 127.000000 1.270000e+02 1.270000e+02 1.270000e+02 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000 1.270000e+02 1.270000e+02 127.000000 1.270000e+02 127.000000 1.270000e+02 127.000000 1.270000e+02 127.000000 1.270000e+02 127.000000 127.000000 1.270000e+02 127.000000 120.000000 1.270000e+02 1.270000e+02 120.000000 1.270000e+02 1.270000e+02 120.000000 1.270000e+02 127.000000 120.000000 127.000000 127.000000 127.000000 1.270000e+02 1.270000e+02 1.270000e+02 127.000000 1.270000e+02 1.270000e+02 127.000000 127.000000 127.000000 127.000000 1.270000e+02 127.000000 1.270000e+02 127.000000 1.270000e+02 122.000000 127.000000 1.270000e+02 127.000000 1.270000e+02 1.270000e+02 127.000000 1.270000e+02 1.270000e+02 127.000000 1.270000e+02 1.270000e+02 127.000000 1.270000e+02 1.270000e+02 127.000000 1.270000e+02 127.000000 1.270000e+02 1.270000e+02 127.000000 1.270000e+02 1.270000e+02 127.000000 1.270000e+02 1.270000e+02 127.000000 127.000000
mean 3283.417323 4.827559 8.306347e+05 7.376373e+05 68.637008 4.946054e+05 26.453543 7.509898e+04 11.937008 1.053928e+05 25.014961 1.555379e+05 36.598425 196.35748 2.325308e+05 3.812839e+04 17.221260 2.984189e+04 13.165354 8286.692913 4.056693 6.077953 8.142946e+05 5.355106e+04 4.805138e+04 84.861417 12514.409449 31.836220 17078.503937 40.233858 13738.543307 21.855118 4720.314961 6.018898 3.326120e+05 2.259253e+05 61.746457 1.215544e+05 64.455906 7.009661e+04 25.166929 3.427472e+04 10.404724 4.411766e+05 19257.598425 5.577953 2.261687e+05 19841.456693 6.954167 1.240878e+05 4.086134e+04 27.624167 8.025269e+05 1.021810e+05 10.292500 1.611869e+05 22710.629921 12.096667 44704.582677 52781.559055 36830.984252 3.337558e+05 1.052025e+05 4.194740e+04 37.062205 2.213245e+05 3.662535e+04 13.206299 70044.692913 22.247244 14.324409 3.977949e+04 14.366142 1.781072e+05 11.667717 5.356610e+04 32.571311 3.527559 1.812918e+05 10.285039 8.055893e+05 3.428528e+04 4.681102 7.695840e+05 1.015211e+05 14.045669 8.306349e+05 7.910610e+05 3.181890 8.306350e+05 1.859169e+05 23.559843 1.402306e+05 17.446457 4.079556e+05 4.226793e+05 0.985039 3.326120e+05 2.675346e+05 77.916535 3.265270e+05 7.857319e+04 37.062205 9.597638
std 1824.155993 1.512890 3.413698e+06 3.064014e+06 32.161415 2.090969e+06 35.747998 3.091958e+05 18.712265 4.224387e+05 19.268862 6.186895e+05 21.542774 710.76828 9.550946e+05 1.573973e+05 6.033447 1.234862e+05 4.517995 33980.028998 1.690195 1.824564 3.347676e+06 2.205613e+05 1.983494e+05 11.377366 51049.048576 9.523177 70168.318020 8.934144 57596.030552 6.998863 20190.460970 3.355184 1.369017e+06 9.335191e+05 9.364012 4.982754e+05 11.158075 2.925341e+05 6.825432 1.447728e+05 4.686215 1.813842e+06 78831.685576 3.322265 9.299445e+05 83545.768594 3.122886 5.099972e+05 1.704399e+05 6.672910 3.305772e+06 4.283684e+05 3.538033 6.636664e+05 95227.745705 4.420748 7154.292111 9781.193787 5636.364186 2.001311e+05 4.369642e+05 1.753055e+05 9.826034 9.071526e+05 1.536515e+05 5.109309 13465.638646 1.905665 4.502657 1.556458e+05 22.616665 7.708174e+05 11.537963 2.254646e+05 16.217155 3.580340 7.868109e+05 13.602212 3.312040e+06 1.401145e+05 1.231191 3.164610e+06 4.148839e+05 3.643101 3.413697e+06 3.252050e+06 3.915768 3.413698e+06 7.631765e+05 5.500408 5.766852e+05 5.002139 1.675501e+06 1.738222e+06 0.041935 1.369017e+06 1.101749e+06 7.439532 1.341838e+06 3.283070e+05 9.826034 5.072372
min 1.000000 2.400000 2.630000e+03 0.000000e+00 0.000000 0.000000e+00 0.000000 0.000000e+00 0.000000 0.000000e+00 0.000000 1.300000e+02 0.000000 0.00000 6.700000e+02 2.200000e+02 9.100000 1.550000e+02 6.800000 70.000000 2.100000 3.000000 2.615000e+03 1.950000e+02 7.000000e+01 35.900000 45.000000 13.000000 20.000000 23.000000 0.000000 0.000000 0.000000 0.000000 1.000000e+03 2.450000e+02 24.500000 1.850000e+02 33.100000 5.500000e+01 11.800000 1.000000e+01 3.900000 1.060000e+03 200.000000 2.000000 0.000000e+00 0.000000 0.000000 0.000000e+00 0.000000e+00 12.800000 0.000000e+00 0.000000e+00 4.400000 0.000000e+00 0.000000 3.600000 28882.000000 32813.000000 24812.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000 0.000000e+00 0.000000e+00 0.000000 47680.000000 16.400000 5.500000 8.100000e+02 0.500000 0.000000e+00 0.000000 0.000000e+00 6.400000 0.000000 0.000000e+00 0.000000 2.570000e+03 1.100000e+02 1.900000 2.385000e+03 2.100000e+02 5.400000 2.632000e+03 2.297000e+03 -6.200000 2.635000e+03 1.095000e+03 15.900000 1.350000e+02 3.800000 1.330000e+03 1.305000e+03 0.900000 1.000000e+03 4.700000e+02 45.400000 0.000000e+00 0.000000e+00 0.000000 4.000000
25% 2402.500000 3.900000 7.662500e+04 3.786750e+04 50.000000 0.000000e+00 0.000000 0.000000e+00 0.000000 1.615250e+04 9.150000 2.767000e+04 18.650000 2.10000 2.250250e+04 3.997500e+03 14.350000 3.047500e+03 10.850000 972.500000 3.200000 4.750000 7.518000e+04 4.377500e+03 3.445000e+03 82.850000 1132.500000 25.650000 1502.500000 33.650000 635.000000 17.000000 122.500000 3.800000 2.788500e+04 1.554250e+04 57.150000 1.133750e+04 57.550000 3.360000e+03 20.650000 1.262500e+03 6.550000 3.806750e+04 2317.500000 3.700000 2.232000e+04 1110.000000 5.250000 1.022000e+04 2.335000e+03 23.050000 7.470000e+04 5.787500e+03 8.000000 1.396750e+04 1290.000000 9.300000 40592.500000 47741.000000 33632.500000 2.091425e+05 7.467500e+03 2.457500e+03 32.450000 2.193000e+04 2.490000e+03 10.250000 61559.500000 21.450000 11.100000 5.942500e+03 2.700000 3.725000e+03 4.600000 1.097500e+03 18.675000 0.900000 2.582500e+03 2.500000 7.428000e+04 3.605000e+03 3.600000 7.126500e+04 9.585000e+03 10.950000 7.662550e+04 7.555050e+04 0.550000 7.662500e+04 1.816000e+04 20.250000 1.418750e+04 15.150000 3.770000e+04 3.893000e+04 1.000000 2.788500e+04 2.089500e+04 77.150000 3.113000e+04 4.992500e+03 32.450000 6.700000
50% 3544.000000 4.500000 1.705350e+05 1.342050e+05 78.700000 0.000000e+00 0.000000 0.000000e+00 0.000000 3.076000e+04 22.500000 4.966500e+04 37.700000 10.00000 5.105500e+04 7.645000e+03 15.800000 6.050000e+03 12.300000 1730.000000 3.600000 5.700000 1.671800e+05 9.060000e+03 7.800000e+03 88.400000 2480.000000 32.200000 3250.000000 38.700000 1595.000000 19.800000 410.000000 4.800000 5.899000e+04 3.697000e+04 62.400000 2.453500e+04 68.000000 8.840000e+03 23.400000 3.450000e+03 9.000000 8.574000e+04 4485.000000 4.400000 5.004000e+04 2880.000000 6.350000 2.428000e+04 6.990000e+03 27.600000 1.671800e+05 1.559500e+04 9.800000 3.185000e+04 3490.000000 11.700000 43553.000000 50886.000000 35650.000000 2.929320e+05 1.754000e+04 7.065000e+03 38.800000 5.244500e+04 6.735000e+03 12.100000 67548.000000 22.200000 14.300000 1.225000e+04 5.700000 1.127000e+04 8.400000 2.660000e+03 29.100000 2.000000 6.340000e+03 4.500000 1.655050e+05 7.995000e+03 4.800000 1.587050e+05 2.328500e+04 14.400000 1.705370e+05 1.724010e+05 2.900000 1.705350e+05 3.638000e+04 22.400000 3.641000e+04 17.900000 8.250000e+04 8.808000e+04 1.000000 5.899000e+04 4.779500e+04 79.500000 6.998000e+04 1.471500e+04 38.800000 7.900000
75% 4706.500000 5.250000 4.670850e+05 4.216925e+05 95.300000 2.508925e+05 56.850000 5.435750e+04 17.250000 6.370750e+04 40.300000 9.512500e+04 51.900000 38.60000 1.312375e+05 2.161500e+04 17.650000 1.651250e+04 13.900000 4827.500000 4.200000 6.800000 4.539225e+05 2.759250e+04 2.476250e+04 91.200000 6777.500000 36.800000 10637.500000 45.700000 5615.000000 26.500000 1732.500000 7.650000 1.797750e+05 1.234550e+05 67.850000 7.609750e+04 72.750000 3.113500e+04 29.450000 1.523750e+04 13.250000 2.463100e+05 10292.500000 5.850000 1.307375e+05 9412.500000 7.625000 6.627750e+04 2.020500e+04 32.100000 4.538450e+05 4.902250e+04 11.825000 9.013000e+04 10207.500000 13.750000 47196.500000 56732.000000 38545.500000 3.679680e+05 5.043000e+04 2.074000e+04 43.450000 1.325475e+05 1.839250e+04 15.100000 74352.500000 23.200000 16.550000 2.395250e+04 13.550000 5.723750e+04 14.000000 1.992000e+04 42.550000 5.100000 4.840000e+04 10.900000 4.492150e+05 2.112500e+04 5.600000 4.295750e+05 6.526500e+04 16.950000 4.670840e+05 4.485595e+05 5.550000 4.670850e+05 1.023475e+05 24.600000 8.449250e+04 21.000000 2.330050e+05 2.362075e+05 1.000000 1.797750e+05 1.460575e+05 82.100000 1.903975e+05 4.281750e+04 43.450000 10.400000
max 6201.000000 12.800000 3.515173e+07 3.122664e+07 100.000000 2.093830e+07 100.000000 3.179190e+06 72.800000 4.458180e+06 81.400000 6.576060e+06 100.200000 4942.60000 9.840725e+06 1.612805e+06 51.100000 1.262335e+06 40.800000 350465.000000 11.300000 13.400000 3.446006e+07 2.266030e+06 2.033665e+06 96.800000 529485.000000 64.300000 722825.000000 64.000000 581550.000000 42.900000 199815.000000 19.100000 1.407632e+07 9.562680e+06 79.700000 5.144695e+06 84.300000 2.967120e+06 45.000000 1.450860e+06 24.200000 1.867247e+07 814645.000000 18.900000 9.573075e+06 839765.000000 20.800000 5.252515e+06 1.729655e+06 46.200000 3.396819e+07 4.324825e+06 23.500000 6.821825e+06 961010.000000 27.400000 68678.000000 87675.000000 58918.000000 1.410609e+06 4.452850e+06 1.775570e+06 52.300000 9.368330e+06 1.550380e+06 32.000000 117723.000000 30.500000 28.300000 1.673780e+06 95.700000 7.540830e+06 60.200000 2.268170e+06 77.900000 16.700000 7.674580e+06 76.300000 3.409178e+07 1.451000e+06 7.200000 3.256856e+07 4.296715e+06 21.700000 3.515173e+07 3.347669e+07 14.600000 3.515173e+07 7.865725e+06 43.400000 5.935630e+06 26.400000 1.726420e+07 1.788753e+07 1.100000 1.407632e+07 1.132358e+07 88.600000 1.382118e+07 3.325950e+06 52.300000 37.300000

Quick check to see if there is any missing values at all:

In [118]:
print(rawdf.isnull().values.any())
True

The answer is Yes, then let's see which columns have missing values:

In [119]:
# checking missing data 
total = rawdf.isnull().sum().sort_values(ascending = False)
percent = 100*(rawdf.isnull().sum()/rawdf.isnull().count()).sort_values(ascending = False)
missing_data  = pd.concat([total, percent, percent.cumsum()], axis=1, keys=['Total missing ', 'Percent', 'Cumulative Percent'])
missing_data.head(10)
Out[119]:
Total missing Percent Cumulative Percent
LOW_INC_UNATTACHED_RATE 7 5.511811 5.511811
CHILD17_ECON_FAM_LOW_INC_RATE 7 5.511811 11.023622
POP_PRIV_HHLDS_LW_INC_RATE 7 5.511811 16.535433
LW_INC_ECON_FAM_RATE 7 5.511811 22.047244
IMM_2006_2016_RATE_IMM 5 3.937008 25.984252
PSG_BELOW_BACH_RATE 0 0.000000 25.984252
HSG_PSG_BELOW_BACH 0 0.000000 25.984252
HSG_PSG_BELOW_BACH_RATE 0 0.000000 25.984252
HSG_PSG_BACH 0 0.000000 25.984252
HSG_PSG_BACH_RATE 0 0.000000 25.984252

Let's see which rows have missing data:

In [120]:
df_null = rawdf.loc[:, rawdf.isnull().any()]
df_null[df_null.isnull().any(axis=1)]
len(df_null[df_null.isnull().any(axis=1)])
df_null.isnull().sum()
Out[120]:
LW_INC_ECON_FAM_RATE             7
LOW_INC_UNATTACHED_RATE          7
POP_PRIV_HHLDS_LW_INC_RATE       7
CHILD17_ECON_FAM_LOW_INC_RATE    7
IMM_2006_2016_RATE_IMM           5
dtype: int64
In [121]:
rawdf.loc[121:126,'Geo Name']
Out[121]:
121                                               Yukon 
122                                          6001 Yukon 
123    Northwest Territories / Territoires du Nord-Ou...
124    6101 Northwest Territories / Territoires du No...
125                                             Nunavut 
126                                        6201 Nunavut 
Name: Geo Name, dtype: object

Last rows in dataset contains information for territories. Low income status does not apply to territories. We can set null values for those to zero. Also, all columns with missing values are proportions. Let's check the values in related population columns:

In [122]:
rawdf.columns[rawdf.isnull().any()]
Out[122]:
Index(['LW_INC_ECON_FAM_RATE', 'LOW_INC_UNATTACHED_RATE', 'POP_PRIV_HHLDS_LW_INC_RATE', 'CHILD17_ECON_FAM_LOW_INC_RATE', 'IMM_2006_2016_RATE_IMM'], dtype='object')
In [123]:
rawdf.loc[{39,81,94,96,97}, ['LW_INC_ECON_FAM','LOW_INC_UNATTACHED','POP_PRIV_HHLDS_LW_INC','CHILD17_ECON_FAM_LOW_INC','IMM_2006_2016'] ]
Out[123]:
LW_INC_ECON_FAM LOW_INC_UNATTACHED POP_PRIV_HHLDS_LW_INC CHILD17_ECON_FAM_LOW_INC IMM_2006_2016
96 0 10 20 10 0
97 575 420 2525 1170 0
39 0 0 0 0 0
81 680 795 3295 1400 0
94 330 195 1540 785 0

It is obvious that the related populations have been recorded as zero, therefore their corresponding rate should be set to zero too:

In [124]:
rawdf = rawdf.fillna(0)
print(rawdf.isnull().values.any())
False

Let's first take the first row which is the data for Canada and investigate if there are duplicated columns: columns that contain the same information but they are labeled differently.

In [125]:
df1 = rawdf.iloc[0]
df1_num = df1.drop(df1.index[1])
df1_num[df1_num.duplicated(keep=False)]
Out[125]:
Geocode                              1
POP_2016                      35151730
SML_POP_CNTRE_RATE                12.7
POP_25_54                     14076325
POP_PRIV_HHLDS_LW_INC_RATE        12.7
RENTER_OVER30_RATE                  40
POP16_2A                      35151730
MF_RATE                              1
POP_25_54.1                   14076325
HOU_AFF                             40
Name: 0, dtype: object

There are two columns POP_2016 and POP16_2A for 2016 population and two columns POP_25_54 and POP_25_54.l for Population aged 25 to 54. We check the whole dataframe to compare all rows:

In [126]:
rawdf[{'POP_2016','POP16_2A'}], rawdf[{'POP_25_54', 'POP_25_54.1'}]
Out[126]:
(     POP16_2A  POP_2016
 0    35151730  35151730
 1      519715    519715
 2      313270    313270
 3       92685     92690
 4       77685     77685
 5       36070     36070
 6      142910    142910
 7      142910    142905
 8      923600    923600
 9      194375    194375
 10     146250    146245
 11     158935    158935
 12     424035    424040
 13     747100    747100
 14     209255    209255
 15     170535    170535
 16     174345    174350
 17      47775     47775
 18      25250     25250
 19      76375     76375
 20      43565     43560
 21    8164360   8164360
 22     197385    197385
 23     276365    276370
 24     729995    729995
 25     508510    508515
 26     472615    472615
 27    1942045   1942045
 28     382605    382605
 29     146715    146715
 30      92520     92515
 31      14235     14230
 32      90310     90315
 33     420085    420080
 34     422995    422995
 35     494795    494795
 36     589400    589400
 37    1353460   1353455
 38      13190     13185
 39      17140     17145
 40   13448495  13448490
 41     113090    113085
 42     134945    134940
 43     645865    645865
 44      88980     88975
 45     161975    161975
 46     109650    109650
 47     179080    179085
 48     548430    548430
 49     536915    536920
 50     161180    161180
 51      59295     59300
 52     102045    102045
 53     193365    193365
 54     126640    126640
 55     169245    169245
 56     455525    455525
 57     447885    447885
 58     123820    123820
 59      76455     76450
 60     934245    934245
 61     110865    110865
 62    1381745   1381745
 63      76795     76800
 64     138235    138235
 65      84200     84195
 66     103590    103590
 67     202760    202760
 68     540250    540245
 69     196450    196445
 70     151885    151890
 71      33045     33050
 72     535155    535155
 73     284460    284460
 74     398950    398955
 75    1109910   1109910
 76    2731570   2731570
 77    1278365   1278365
 78     720885    720885
 79     165600    165600
 80     127610    127610
 81      72220     72220
 82     192050    192050
 83    1098350   1098350
 84      56220     56220
 85      53615     53615
 86      42970     42965
 87     278270    278270
 88      55110     55110
 89     344515    344515
 90      41885     41880
 91      39565     39565
 92      75605     75605
 93      75145     75145
 94      22280     22280
 95      10540     10540
 96       2635      2630
 97      35455     35455
 98    4067175   4067175
 99     291110    291110
 100   1551875   1551875
 101    461555    461555
 102   1320795   1320800
 103    441835    441835
 104   4648055   4648055
 105     79855     79855
 106     78465     78465
 107    362255    362260
 108    219465    219470
 109    295755    295755
 110    639245    639245
 111    784980    784975
 112    198310    198310
 113    649030    649025
 114    284400    284400
 115    383360    383360
 116    270815    270815
 117    122230    122235
 118     71550     71545
 119    140450    140450
 120     67890     67885
 121     35875     35875
 122     35875     35875
 123     41785     41785
 124     41785     41785
 125     35945     35940
 126     35945     35945,      POP_25_54  POP_25_54.1
 0     14076325     14076325
 1       203385       203385
 2       127455       127455
 3        32955        32955
 4        27715        27715
 5        15260        15260
 6        52555        52555
 7        52555        52555
 8       349435       349435
 9        67250        67250
 10       52130        52130
 11       53135        53135
 12      176925       176925
 13      282785       282785
 14       82135        82135
 15       64450        64450
 16       67615        67615
 17       17070        17070
 18        8515         8515
 19       27420        27420
 20       15580        15580
 21     3185860      3185860
 22       66195        66195
 23       98055        98055
 24      281500       281500
 25      178965       178965
 26      168620       168620
 27      833720       833720
 28      154755       154755
 29       54705        54705
 30       35410        35410
 31        5540         5540
 32       30045        30045
 33      153885       153885
 34      167110       167110
 35      191090       191090
 36      227100       227100
 37      527695       527695
 38        4830         4830
 39        6650         6650
 40     5410555      5410555
 41       38695        38695
 42       51545        51545
 43      263235       263235
 44       32595        32595
 45       53480        53480
 46       37660        37660
 47       56925        56925
 48      224230       224230
 49      212345       212345
 50       56375        56375
 51       19135        19135
 52       35650        35650
 53       71655        71655
 54       44230        44230
 55       58990        58990
 56      179930       179930
 57      162205       162205
 58       43815        43815
 59       27445        27445
 60      383885       383885
 61       41555        41555
 62      582635       582635
 63       27780        27780
 64       47765        47765
 65       31980        31980
 66       37665        37665
 67       74395        74395
 68      205155       205155
 69       74695        74695
 70       55995        55995
 71       11490        11490
 72      221450       221450
 73      113055       113055
 74      151285       151285
 75      457835       457835
 76     1221810      1221810
 77      492790       492790
 78      295675       295675
 79       58130        58130
 80       44110        44110
 81       26305        26305
 82       68575        68575
 83      424180       424180
 84       21615        21615
 85       19170        19170
 86       13905        13905
 87      114335       114335
 88       19035        19035
 89      141525       141525
 90       14270        14270
 91       13205        13205
 92       26075        26075
 93       28280        28280
 94        7835         7835
 95        3920         3920
 96        1000         1000
 97       12755        12755
 98     1763210      1763210
 99      107060       107060
 100     703130       703130
 101     179620       179620
 102     582385       582385
 103     191015       191015
 104    1863055      1863055
 105      29010        29010
 106      27550        27550
 107     125845       125845
 108      78215        78215
 109     110610       110610
 110     278515       278515
 111     317950       317950
 112      81970        81970
 113     313755       313755
 114     110195       110195
 115     145695       145695
 116      89690        89690
 117      41740        41740
 118      27990        27990
 119      55250        55250
 120      29065        29065
 121      15605        15605
 122      15605        15605
 123      18760        18760
 124      18760        18760
 125      14140        14140
 126      14140        14140)

The are essentially the same. There are two more columns with the same values but their names are not similar. Is the similarity in values for Canada a coincidence? Let's check them too:

In [127]:
rawdf[{'RENTER_OVER30_RATE','HOU_AFF'}]
rawdf[{'SML_POP_CNTRE_RATE','POP_PRIV_HHLDS_LW_INC_RATE'}]
Out[127]:
POP_PRIV_HHLDS_LW_INC_RATE SML_POP_CNTRE_RATE
0 12.7 12.7
1 8.8 23.7
2 9.5 10.4
3 7.9 34.4
4 9.4 53.8
5 4.4 47.4
6 9.5 13.8
7 9.5 13.8
8 11.1 23.1
9 9.7 28.9
10 9.9 43.4
11 10.3 50.8
12 12.6 3.1
13 10.4 18.7
14 10.3 8.2
15 11.7 23.0
16 10.6 10.0
17 8.9 41.0
18 11.6 44.1
19 9.1 31.4
20 8.1 26.0
21 13.0 12.7
22 8.2 31.0
23 8.6 26.7
24 11.4 7.9
25 11.5 13.5
26 10.5 19.9
27 23.5 0.0
28 11.9 9.6
29 7.5 53.4
30 5.6 67.4
31 4.9 81.4
32 7.1 23.6
33 6.2 28.8
34 12.1 0.0
35 8.9 18.5
36 9.3 16.9
37 9.3 10.7
38 6.4 36.2
39 0.0 44.7
40 13.7 10.0
41 11.9 14.6
42 10.0 11.5
43 8.9 6.0
44 8.9 15.3
45 8.0 45.9
46 5.9 45.3
47 7.3 40.6
48 8.0 1.7
49 16.6 2.0
50 10.4 10.0
51 7.1 40.1
52 11.5 27.0
53 10.8 4.5
54 9.6 15.2
55 7.5 40.2
56 14.7 6.2
57 11.3 16.5
58 10.8 13.1
59 7.2 40.4
60 13.8 1.8
61 6.9 33.8
62 14.2 2.4
63 6.8 26.9
64 12.0 7.3
65 8.3 67.2
66 7.0 53.3
67 9.1 29.7
68 8.0 36.6
69 9.8 24.3
70 10.9 5.3
71 9.9 50.0
72 10.2 6.4
73 7.6 18.0
74 13.3 7.3
75 12.7 5.1
76 22.8 0.0
77 13.1 13.7
78 16.7 0.0
79 9.6 20.7
80 6.0 24.1
81 9.8 39.1
82 7.0 42.5
83 9.1 17.7
84 5.0 45.5
85 8.6 4.5
86 6.6 46.1
87 9.5 6.3
88 7.5 49.0
89 9.9 11.1
90 5.1 33.9
91 7.7 35.9
92 11.9 3.1
93 8.8 30.0
94 16.6 46.4
95 14.4 0.0
96 5.7 0.0
97 15.5 29.1
98 9.3 14.8
99 8.4 22.5
100 10.1 9.8
101 6.7 34.3
102 10.7 8.5
103 5.8 26.1
104 14.4 11.7
105 8.1 56.6
106 11.0 53.2
107 10.7 13.0
108 10.1 24.9
109 10.8 5.2
110 17.3 0.0
111 13.6 7.1
112 22.9 0.0
113 21.5 0.0
114 13.1 22.5
115 12.2 2.4
116 12.0 38.0
117 11.8 8.7
118 8.8 56.5
119 9.9 14.8
120 6.5 54.2
121 0.0 60.6
122 0.0 60.6
123 0.0 64.1
124 0.0 64.1
125 0.0 48.9
126 0.0 49.0

The first two columns are essentially the same, but the last two are different. We drop duplicated columns:

In [128]:
df = rawdf.drop(['POP16_2A','POP_25_54.1', 'HOU_AFF'], axis = 1)
df.head()
Out[128]:
Geocode Geo Name GNR POP_2016 POP_MIZ POP_MIZ_RATE POP_LRG_POP_CNTR LRG_POP_CNTR_RATE POP_MED_POP_CNTR MED_POP_CNTR_RATE POP_SML_POP_CNTRE SML_POP_CNTRE_RATE POP_RRL RRL_RATE POP_DENSE CENSUS_FAM_PRIV_HHLDS LONE LONE_RATE LONE_FEMALE LONE_FEMALE_RATE LONE_MALE LONE_MALE_RATE GNR.1 POP_PRIV_HHLD_2016 POP_25_29 HSG HSG_RATE HSG_ONLY HSG_ONLY_RATE HSG_PSG_BELOW_BACH HSG_PSG_BELOW_BACH_RATE HSG_PSG_BACH HSG_PSG_BACH_RATE HSG_PSG_ABOVE_BACH HSG_PSG_ABOVE_BACH_RATE POP_25_54 PSG PSG_RATE PSG_BELOW_BACH PSG_BELOW_BACH_RATE PSG_BACH PSG_BACH_RATE PSG_ABOVE_BACH PSG_ABOVE_BACH_RATE LABOUR UE UE_RATE ECON_FAM_LW_INC_DENOM LW_INC_ECON_FAM LW_INC_ECON_FAM_RATE UNATTACHED_LW_INC_DENOM LOW_INC_UNATTACHED LOW_INC_UNATTACHED_RATE POP_PRIV_HHLDS_LW_INC_DENOM POP_PRIV_HHLDS_LW_INC POP_PRIV_HHLDS_LW_INC_RATE CHILD17_ECON_FAM_LOW_INC_DENOM CHILD17_ECON_FAM_LOW_INC CHILD17_ECON_FAM_LOW_INC_RATE AVE_PERS_INC AVE_PERS_INC_MALE AVE_PERS_INC_FEMALE AVE_DWELL RENTER_SHELCO_HHLD RENTER_OVER30 RENTER_OVER30_RATE OWNER_SHELCO_HHLD OWNER_OVER30 OWNER_OVER30_RATE MEDIAN_HHLD_INC INC_SHR_BELOW_MEDIAN GOVT_TRNSFR_INC ABO ABO_RATE IMM IMM_RATE IMM_2006_2016 IMM_2006_2016_RATE_IMM IMM_2006_2016_RATE_TOTPOP VIS_MIN VISMIN_RATE POP_ONE_PLUS ONE_YR_INT_MIG ONE_YR_INT_MIG_RATE POP_FIVE_PLUS FIVE_YR_INT_MIG FIVE_YR_INT_MIG_RATE POP16 POP11 GROWTH POP_0_19 PCT_0_19 POP_65_PLUS PCT_65_PLUS MALE FEMALE MF_RATE EMP_25_54 EMP_RATE_25_54 SHELCO_HH OVER30_HH UNEMP_RATE
0 1 Canada 4.0 35151730 31226645 88.8 20938295 59.6 3179190 9.0 4458180 12.7 6576060 18.7 3.9 9840725 1612805 16.4 1262335 12.8 350465 3.6 5.1 34460065 2266030 2033665 89.7 529485 26.0 722825 35.5 581550 28.6 199815 9.8 14076325 9562680 67.9 5144695 53.8 2967120 31.0 1450860 15.2 18672470 814645 4.4 9573075 839765 8.8 5252515 1729655 32.9 33968190 4324825 12.7 6821825 961010 14.1 47487 56740 38632 443058 4452850 1775570 40.0 9368330 1550380 16.6 70332 20.5 11.7 1673780 4.9 7540830 21.9 2268170 30.1 6.6 7674580 22.3 34091785 1451000 4.3 32568565 4296715 13.2 35151728 33476688 5.0 7865725 22.4 5935630 16.9 17264200 17887530 1.0 11323575 80.4 13821180 3325950 7.7
1 10 Newfoundland and Labrador / Terre-Neuve-et-Lab... 4.0 519715 312530 60.1 178430 34.3 0 0.0 123290 23.7 218000 41.9 1.4 161005 24250 15.1 19290 12.0 4955 3.1 6.8 512250 28130 25455 90.5 6375 25.0 11515 45.2 5950 23.4 1615 6.3 203385 134670 66.2 91545 68.0 28820 21.4 14305 10.6 256855 24455 9.5 158975 9415 5.9 69330 19125 27.6 509340 45045 8.8 89210 10240 11.5 45210 56724 34259 243157 50415 19810 39.4 167335 17695 10.6 67320 24.2 16.5 45725 8.9 12075 2.4 5540 45.9 1.1 11815 2.3 507965 22185 4.4 489800 66305 13.5 519716 514536 1.0 101695 19.6 101030 19.4 253925 265790 1.0 146440 72.0 217750 37500 15.6
2 1011 1011 Eastern Regional Health Authority 3.7 313270 235790 75.3 178425 57.0 0 0.0 32520 10.4 102325 32.7 16.2 94510 15045 15.9 12025 12.7 3020 3.2 6.5 309015 19165 17785 92.8 4175 23.5 7470 42.0 4780 26.9 1360 7.6 127455 89625 70.3 56775 63.3 21620 24.1 11225 12.5 161620 11245 7.0 94065 5740 6.1 44770 13255 29.6 309015 29230 9.5 55415 6570 11.9 48517 60868 36888 285746 31990 13000 40.8 98985 11875 12.0 72211 20.3 13.8 8635 2.8 9225 3.0 4360 47.3 1.4 9430 3.1 306240 13165 4.3 294790 40710 13.8 313267 306517 2.2 62495 19.9 56040 17.9 152620 160650 1.0 96710 75.9 130975 24880 11.9
3 1012 1012 Central Regional Health Authority 4.1 92690 43275 46.7 0 0.0 0 0.0 31860 34.4 60835 65.6 2.2 30630 3570 11.7 2825 9.2 740 2.4 6.8 91160 3430 2895 84.4 905 31.3 1525 52.7 370 12.8 85 2.9 32955 18250 55.4 14490 79.4 2545 13.9 1220 6.7 41070 6080 14.8 30070 1705 5.7 10295 2535 24.6 90205 7165 7.9 14680 1660 11.3 37795 48390 27521 162012 7510 2985 39.9 31920 2625 8.2 56442 21.9 24.2 5270 5.8 855 0.9 295 34.5 0.3 780 0.9 90575 3995 4.4 88000 11715 13.3 92690 93642 -1.0 16690 18.0 22100 23.8 45450 47240 1.0 20900 63.4 39430 5610 23.1
4 1013 1013 Western Regional Health Authority 4.7 77685 33470 43.1 0 0.0 0 0.0 41830 53.8 35855 46.2 2.5 24655 4065 16.5 3275 13.3 790 3.2 8.1 76520 3315 2980 89.9 845 28.4 1520 51.0 495 16.6 120 4.0 27715 17280 62.3 12855 74.4 3130 18.1 1295 7.5 35455 4485 12.6 24275 1650 6.8 11025 2810 25.5 76525 7185 9.4 12240 1605 13.1 38028 46608 29950 195672 8065 3375 41.9 25995 2665 10.2 55950 21.5 22.9 19510 25.5 1040 1.4 360 34.6 0.5 685 0.9 75995 3820 5.0 73560 10170 13.8 77687 77983 -0.4 13945 18.0 17840 23.0 37710 39975 0.9 18205 65.7 34055 6040 21.2
In [129]:
# comparing sizes of data frames 
print("\nNumber of features of the old data frame is:", rawdf.shape[1], "\nNumber of features of the new data frame is:",  
      df.shape[1] )
Number of features of the old data frame is: 105 
Number of features of the new data frame is: 102

To build data dictionary for dataset we use the source data for short form and long form by loading them into dfshort and dflong and combine them to one data frame dfshortlong. We only keep the columns Census profile and VALUE.

In [130]:
dfshort = pd.read_csv('1710012201_databaseLoadingData.csv')
dfshort = dfshort[{'Census profile','VALUE'}]
dflong = pd.read_csv('1710012301_databaseLoadingData.csv')
dflong = dflong[{'Census indicator profile','VALUE'}]
dflong.columns = ['Census profile','VALUE']
dfshortlong = pd.concat([dfshort, dflong], sort=True)
dfshortlong
Out[130]:
Census profile VALUE
0 Global non-response rate 4.0
1 Total population 35151730.0
2 Census families in private households 9840725.0
3 Lone-parent families 1612805.0
4 Lone-parent families, proportion of census fam... 16.4
5 Female lone-parent families 1262335.0
6 Female lone-parent families, proportion of cen... 78.3
7 Male lone-parent families 350465.0
8 Male lone-parent families, proportion of censu... 21.7
9 Population living within a Census Metropolitan... 31226645.0
10 Population living within a Census Metropolitan... 88.8
11 Large urban population centre population 20938295.0
12 Large urban population centre population, prop... 59.6
13 Medium population centre population 3179190.0
14 Medium population centre population, proportio... 9.0
15 Small population centre population 4458180.0
16 Small population centre population, proportion... 12.7
17 Rural population 6576060.0
18 Rural population, proportion of total population 18.7
19 Population density (population per square kilo... 3.9
0 Global non-response rate 5.1
1 Total population in private households 34460065.0
2 Population aged 25 to 29 2266030.0
3 High school graduates aged 25 to 29 2033665.0
4 High school graduates aged 25 to 29, proportio... 89.7
5 High school graduates only, aged 25 to 29 529485.0
6 High school graduates only, aged 25 to 29, pro... 26.0
7 Post-secondary graduates, below bachelor's deg... 722825.0
8 Post-secondary graduates, below bachelor's deg... 35.5
9 Post-secondary graduates, bachelor's degree, a... 581550.0
10 Post-secondary graduates, bachelor's degree, a... 28.6
11 Post-secondary graduates, above bachelor's deg... 199815.0
12 Post-secondary graduates, above bachelor's deg... 9.8
13 Population aged 25 to 54 14076325.0
14 Post-secondary graduates aged 25 to 54 9562680.0
15 Post-secondary graduates aged 25 to 54, propor... 67.9
16 Post-secondary graduates, below bachelor's deg... 5144695.0
17 Post-secondary graduates, below bachelor's deg... 53.8
18 Post-secondary graduates, bachelor's degree, a... 2967120.0
19 Post-secondary graduates, bachelor's degree, a... 31.0
20 Post-secondary graduates, above bachelor's deg... 1450860.0
21 Post-secondary graduates, above bachelor's deg... 15.2
22 Employed persons aged 25 to 54 years 11323575.0
23 Employment rate, 25 to 54 years 80.4
24 Labour force aged 15 and over 18672470.0
25 Unemployment rate 15 years and over 7.7
26 Long-term unemployed 814645.0
27 Long-term unemployment rate, labour force aged... 4.4
28 Total economic families for income status 9573075.0
29 Economic families in low income before tax in ... 839765.0
30 Prevalence of low income before tax in 2015 fo... 8.8
31 Total persons aged 17 years and under living i... 6821825.0
32 Persons aged 17 years and under living in low ... 961010.0
33 Prevalence of persons aged 17 years and under ... 14.1
34 Total persons 15 years and over not in an econ... 5252515.0
35 Persons 15 years and over not in economic fami... 1729655.0
36 Prevalence of low income before tax in 2015 fo... 32.9
37 Population in private households for income st... 33968190.0
38 Persons in private households in low income be... 4324825.0
39 Prevalence of low income before tax in 2015 fo... 12.7
40 Average total income in 2015 of population 15 ... 47487.0
41 Average total income in 2015 of males 15 years... 56740.0
42 Average total income in 2015 of females 15 yea... 38632.0
43 Average value of dwelling 443058.0
44 Total private non-farm, non-reserve households 13821180.0
45 Households spending 30% or more of household i... 3325950.0
46 Households spending 30% or more of household i... 40.0
47 Tenant-occupied private non-farm, non-reserve ... 4452850.0
48 Tenant-occupied households spending 30% or mor... 1775570.0
49 Proportion of tenant-occupied households spend... 40.0
50 Owner-occupied private non-farm, non-reserve d... 9368330.0
51 Owner households spending 30% or more of house... 1550380.0
52 Proportion of owner households spending 30% or... 16.6
53 Median household income 70332.0
54 Income share held by households whose incomes ... 20.5
55 Government transfer income in 2015, as a propo... 11.7
56 Aboriginal identity population 1673780.0
57 Aboriginal identity population, proportion of ... 4.9
58 Immigrant population 7540830.0
59 Immigrant population, proportion of total popu... 21.9
60 Immigrants who arrived between 2006 and 2016 2268170.0
61 Immigrants who arrived between 2006 and 2016, ... 30.1
62 Immigrants who arrived between 2006 and 2016, ... 6.6
63 Visible minority population 7674580.0
64 Visible minority population, proportion of tot... 22.3
65 Population aged 1 year and over 34091785.0
66 1-year internal migrants 1451000.0
67 1-year internal migrants, proportion of popula... 4.3
68 Population aged 5 years and over 32568565.0
69 5-year internal migrants 4296715.0
70 5-year internal migrants, proportion of popula... 13.2

Now we make a data frame called data_dictionary with three columns: 'Feature', 'Type', 'Description'. "Feature" and "Type" refers to column names and type in HR_2016_Census_simple. We will use the column "Census profile" of dfshortlong as a description for each feature in HR_2016_Census_simple. Note that dfshortlong has only 91 rows, while HR_2016_Census_simple has 105 rows. Then we expect to have some null values in description column: at least (103-91=12) if all data in both datasets match.

In [131]:
dfshortlong = dfshortlong[{'Census profile','VALUE'}]
df1 = rawdf.iloc[0]
data_dictionary = pd.DataFrame(columns=['Feature','Type', 'Description'])
data_dictionary.loc[0] = ['Geocode', 'numeric', 'Geographic code: 1-4 digits']
data_dictionary.loc[1] = ['Geo Name', 'string', 'Geographical names: Canada, provinces and territories and health regions']

for j in range (2, len(df1)): 
    k = 0
    for i in range(0, len(dfshortlong)):
        if df1.values[j] == dfshortlong['VALUE'].values[i]:
            k = 1
            data_dictionary.loc[j] = [df1.index[j], 'numeric', dfshortlong['Census profile'].values[i]]
    if k == 0:
        data_dictionary.loc[j] = [df1.index[j], 'numeric', np.nan]
        
data_dictionary
Out[131]:
Feature Type Description
0 Geocode numeric Geographic code: 1-4 digits
1 Geo Name string Geographical names: Canada, provinces and terr...
2 GNR numeric Global non-response rate
3 POP_2016 numeric Total population
4 POP_MIZ numeric Population living within a Census Metropolitan...
5 POP_MIZ_RATE numeric Population living within a Census Metropolitan...
6 POP_LRG_POP_CNTR numeric Large urban population centre population
7 LRG_POP_CNTR_RATE numeric Large urban population centre population, prop...
8 POP_MED_POP_CNTR numeric Medium population centre population
9 MED_POP_CNTR_RATE numeric Medium population centre population, proportio...
10 POP_SML_POP_CNTRE numeric Small population centre population
11 SML_POP_CNTRE_RATE numeric Prevalence of low income before tax in 2015 fo...
12 POP_RRL numeric Rural population
13 RRL_RATE numeric Rural population, proportion of total population
14 POP_DENSE numeric Population density (population per square kilo...
15 CENSUS_FAM_PRIV_HHLDS numeric Census families in private households
16 LONE numeric Lone-parent families
17 LONE_RATE numeric Lone-parent families, proportion of census fam...
18 LONE_FEMALE numeric Female lone-parent families
19 LONE_FEMALE_RATE numeric NaN
20 LONE_MALE numeric Male lone-parent families
21 LONE_MALE_RATE numeric NaN
22 GNR.1 numeric Global non-response rate
23 POP_PRIV_HHLD_2016 numeric Total population in private households
24 POP_25_29 numeric Population aged 25 to 29
25 HSG numeric High school graduates aged 25 to 29
26 HSG_RATE numeric High school graduates aged 25 to 29, proportio...
27 HSG_ONLY numeric High school graduates only, aged 25 to 29
28 HSG_ONLY_RATE numeric High school graduates only, aged 25 to 29, pro...
29 HSG_PSG_BELOW_BACH numeric Post-secondary graduates, below bachelor's deg...
30 HSG_PSG_BELOW_BACH_RATE numeric Post-secondary graduates, below bachelor's deg...
31 HSG_PSG_BACH numeric Post-secondary graduates, bachelor's degree, a...
32 HSG_PSG_BACH_RATE numeric Post-secondary graduates, bachelor's degree, a...
33 HSG_PSG_ABOVE_BACH numeric Post-secondary graduates, above bachelor's deg...
34 HSG_PSG_ABOVE_BACH_RATE numeric Post-secondary graduates, above bachelor's deg...
35 POP_25_54 numeric Population aged 25 to 54
36 PSG numeric Post-secondary graduates aged 25 to 54
37 PSG_RATE numeric Post-secondary graduates aged 25 to 54, propor...
38 PSG_BELOW_BACH numeric Post-secondary graduates, below bachelor's deg...
39 PSG_BELOW_BACH_RATE numeric Post-secondary graduates, below bachelor's deg...
40 PSG_BACH numeric Post-secondary graduates, bachelor's degree, a...
41 PSG_BACH_RATE numeric Post-secondary graduates, bachelor's degree, a...
42 PSG_ABOVE_BACH numeric Post-secondary graduates, above bachelor's deg...
43 PSG_ABOVE_BACH_RATE numeric Post-secondary graduates, above bachelor's deg...
44 LABOUR numeric Labour force aged 15 and over
45 UE numeric Long-term unemployed
46 UE_RATE numeric Long-term unemployment rate, labour force aged...
47 ECON_FAM_LW_INC_DENOM numeric Total economic families for income status
48 LW_INC_ECON_FAM numeric Economic families in low income before tax in ...
49 LW_INC_ECON_FAM_RATE numeric Prevalence of low income before tax in 2015 fo...
50 UNATTACHED_LW_INC_DENOM numeric Total persons 15 years and over not in an econ...
51 LOW_INC_UNATTACHED numeric Persons 15 years and over not in economic fami...
52 LOW_INC_UNATTACHED_RATE numeric Prevalence of low income before tax in 2015 fo...
53 POP_PRIV_HHLDS_LW_INC_DENOM numeric Population in private households for income st...
54 POP_PRIV_HHLDS_LW_INC numeric Persons in private households in low income be...
55 POP_PRIV_HHLDS_LW_INC_RATE numeric Prevalence of low income before tax in 2015 fo...
56 CHILD17_ECON_FAM_LOW_INC_DENOM numeric Total persons aged 17 years and under living i...
57 CHILD17_ECON_FAM_LOW_INC numeric Persons aged 17 years and under living in low ...
58 CHILD17_ECON_FAM_LOW_INC_RATE numeric Prevalence of persons aged 17 years and under ...
59 AVE_PERS_INC numeric Average total income in 2015 of population 15 ...
60 AVE_PERS_INC_MALE numeric Average total income in 2015 of males 15 years...
61 AVE_PERS_INC_FEMALE numeric Average total income in 2015 of females 15 yea...
62 AVE_DWELL numeric Average value of dwelling
63 RENTER_SHELCO_HHLD numeric Tenant-occupied private non-farm, non-reserve ...
64 RENTER_OVER30 numeric Tenant-occupied households spending 30% or mor...
65 RENTER_OVER30_RATE numeric Proportion of tenant-occupied households spend...
66 OWNER_SHELCO_HHLD numeric Owner-occupied private non-farm, non-reserve d...
67 OWNER_OVER30 numeric Owner households spending 30% or more of house...
68 OWNER_OVER30_RATE numeric Proportion of owner households spending 30% or...
69 MEDIAN_HHLD_INC numeric Median household income
70 INC_SHR_BELOW_MEDIAN numeric Income share held by households whose incomes ...
71 GOVT_TRNSFR_INC numeric Government transfer income in 2015, as a propo...
72 ABO numeric Aboriginal identity population
73 ABO_RATE numeric Aboriginal identity population, proportion of ...
74 IMM numeric Immigrant population
75 IMM_RATE numeric Immigrant population, proportion of total popu...
76 IMM_2006_2016 numeric Immigrants who arrived between 2006 and 2016
77 IMM_2006_2016_RATE_IMM numeric Immigrants who arrived between 2006 and 2016, ...
78 IMM_2006_2016_RATE_TOTPOP numeric Immigrants who arrived between 2006 and 2016, ...
79 VIS_MIN numeric Visible minority population
80 VISMIN_RATE numeric Visible minority population, proportion of tot...
81 POP_ONE_PLUS numeric Population aged 1 year and over
82 ONE_YR_INT_MIG numeric 1-year internal migrants
83 ONE_YR_INT_MIG_RATE numeric 1-year internal migrants, proportion of popula...
84 POP_FIVE_PLUS numeric Population aged 5 years and over
85 FIVE_YR_INT_MIG numeric 5-year internal migrants
86 FIVE_YR_INT_MIG_RATE numeric 5-year internal migrants, proportion of popula...
87 POP16 numeric NaN
88 POP11 numeric NaN
89 GROWTH numeric NaN
90 POP16_2A numeric Total population
91 POP_0_19 numeric NaN
92 PCT_0_19 numeric NaN
93 POP_65_PLUS numeric NaN
94 PCT_65_PLUS numeric NaN
95 MALE numeric NaN
96 FEMALE numeric NaN
97 MF_RATE numeric NaN
98 POP_25_54.1 numeric Population aged 25 to 54
99 EMP_25_54 numeric Employed persons aged 25 to 54 years
100 EMP_RATE_25_54 numeric Employment rate, 25 to 54 years
101 SHELCO_HH numeric Total private non-farm, non-reserve households
102 OVER30_HH numeric Households spending 30% or more of household i...
103 HOU_AFF numeric Proportion of tenant-occupied households spend...
104 UNEMP_RATE numeric Unemployment rate 15 years and over

We check to see if we have any null values:

In [132]:
print(data_dictionary.isnull().values.any())
True

The answer is Yes, then we collect all the rows with empty description column into null_data:

In [133]:
null_data = data_dictionary[data_dictionary.isnull().any(axis=1)]
null_data, len(null_data)
Out[133]:
(             Feature     Type Description
 19  LONE_FEMALE_RATE  numeric         NaN
 21    LONE_MALE_RATE  numeric         NaN
 87             POP16  numeric         NaN
 88             POP11  numeric         NaN
 89            GROWTH  numeric         NaN
 91          POP_0_19  numeric         NaN
 92          PCT_0_19  numeric         NaN
 93       POP_65_PLUS  numeric         NaN
 94       PCT_65_PLUS  numeric         NaN
 95              MALE  numeric         NaN
 96            FEMALE  numeric         NaN
 97           MF_RATE  numeric         NaN, 12)

There are only 12 features with empty description, it seems about right. However, columns LONE_FEMALE_RATE and LONE_MALE_RATE are related to columns LONE_FEMALE and LONE_MALE and the description for those columns has been filled:

In [134]:
data_dictionary[data_dictionary['Feature'] == 'LONE_FEMALE_RATE'],data_dictionary[data_dictionary['Feature'] == 'LONE_MALE']
Out[134]:
(             Feature     Type Description
 19  LONE_FEMALE_RATE  numeric         NaN,
       Feature     Type                Description
 20  LONE_MALE  numeric  Male lone-parent families)

By a closer look at short form and long from we find that usually if there is a rate for a Census profile, it comes after the population for that Census profile. Therefore we can check if those rate exist in the dfshortlong that we have downloaded from statcan website:

In [135]:
dfshortlong.iloc[dfshortlong[dfshortlong['VALUE'] == df1['LONE_FEMALE']].index[0]:dfshortlong[dfshortlong['VALUE'] == df1['LONE_FEMALE']].index[0]+4]
Out[135]:
VALUE Census profile
5 1262335.0 Female lone-parent families
6 78.3 Female lone-parent families, proportion of cen...
7 350465.0 Male lone-parent families
8 21.7 Male lone-parent families, proportion of censu...

It is obvious that the two rates exist in dfshortlong. Now let's have a look at the values that we have for these two features in our dataset HR_2016_Census_simple for Canada:

In [136]:
df1[{'LONE_FEMALE_RATE','LONE_MALE_RATE'}]
Out[136]:
LONE_MALE_RATE       3.6
LONE_FEMALE_RATE    12.8
Name: 0, dtype: object

Evidently, there is a noticeable discrepancy between two datasets for these two features. The dataset we downloaded from statcan website indicates that the proportion of female lone-parent families in Canada is, 78.3% of total census families while this proportion is 21.7% for male. It is obvious the sum of these two proportions is 100%; therefore they cannot be a proportion of census families, rather they are proportion of census lone-parent families. Our dataset though indicates that the proportion of female lone-parent families in Canada is 12.8% of total census families and the proportion for male is 3.6%, that sounds like reasonable. Then we add the description for them in our data_dictionary.

In [137]:
data_dictionary.iloc[data_dictionary[data_dictionary['Feature'] == 'LONE_FEMALE_RATE'].index[0]] = ['LONE_FEMALE_RATE', 'numeric', 'Female lone-parent families, proportion of census families']
data_dictionary.iloc[data_dictionary[data_dictionary['Feature'] == 'LONE_MALE_RATE'].index[0]] = ['LONE_MALE_RATE', 'numeric', 'Male lone-parent families, proportion of census families']
null_data = data_dictionary[data_dictionary.isnull().any(axis=1)]
null_data, len(null_data)
Out[137]:
(        Feature     Type Description
 87        POP16  numeric         NaN
 88        POP11  numeric         NaN
 89       GROWTH  numeric         NaN
 91     POP_0_19  numeric         NaN
 92     PCT_0_19  numeric         NaN
 93  POP_65_PLUS  numeric         NaN
 94  PCT_65_PLUS  numeric         NaN
 95         MALE  numeric         NaN
 96       FEMALE  numeric         NaN
 97      MF_RATE  numeric         NaN, 10)

Now we have 10 features with empty description. Again looking at duplicated values for Canada below, there are couple of more values that are similar but they are for different features. Those columns are: SML_POP_CNTRE_RATE, POP_PRIV_HHLDS_LW_INC_RATE and RENTER_OVER30_RATE, HOU_AFF. We need to make sure we have the right description for them:

In [138]:
df1_num = df1.drop(df1.index[1])
df1_num[df1_num.duplicated(keep=False)]
Out[138]:
Geocode                              1
POP_2016                      35151730
SML_POP_CNTRE_RATE                12.7
POP_25_54                     14076325
POP_PRIV_HHLDS_LW_INC_RATE        12.7
RENTER_OVER30_RATE                  40
POP16_2A                      35151730
MF_RATE                              1
POP_25_54.1                   14076325
HOU_AFF                             40
Name: 0, dtype: object
In [139]:
#from IPython.core.interactiveshell import InteractiveShell
#InteractiveShell.ast_node_interactivity = "all"
data_dictionary[data_dictionary['Feature'] == 'SML_POP_CNTRE_RATE']
data_dictionary[data_dictionary['Feature'] == 'POP_PRIV_HHLDS_LW_INC_RATE']
data_dictionary[data_dictionary['Feature'] == 'RENTER_OVER30_RATE']
data_dictionary[data_dictionary['Feature'] == 'HOU_AFF'] 
Out[139]:
Feature Type Description
103 HOU_AFF numeric Proportion of tenant-occupied households spend...

We notice that description for SML_POP_CNTRE_RATE and HOU_AFF are not correct. We can adjust them and also it is not difficult to find the description for the remaining features:

In [140]:
data_dictionary.iloc[data_dictionary[data_dictionary['Feature'] == 'SML_POP_CNTRE_RATE'].index[0]] = ['SML_POP_CNTRE_RATE', 'numeric', 'Small population centre population, proportion of total population']
data_dictionary.iloc[data_dictionary[data_dictionary['Feature'] == 'HOU_AFF'].index[0]] = ['HOU_AFF', 'numeric', 'Households spending 30% or more of household income on shelter, proportion of total shelter-cost households']
data_dictionary.iloc[data_dictionary[data_dictionary['Feature'] == 'POP11'].index[0]] = ['POP11', 'numeric', '2011 population']
data_dictionary.iloc[data_dictionary[data_dictionary['Feature'] == 'GROWTH'].index[0]] = ['GROWTH', 'numeric', '2011 to 2016 population growth (%)']
data_dictionary.iloc[data_dictionary[data_dictionary['Feature'] == 'POP_0_19'].index[0]] = ['POP_0_19', 'numeric', 'Population aged 0 to 19']
data_dictionary.iloc[data_dictionary[data_dictionary['Feature'] == 'PCT_0_19'].index[0]] = ['PCT_0_19', 'numeric', 'Population aged 0 to 19, proportion of total population']
data_dictionary.iloc[data_dictionary[data_dictionary['Feature'] == 'POP_65_PLUS'].index[0]] = ['POP_65_PLUS', 'numeric', 'Populaiton aged 65 and over']
data_dictionary.iloc[data_dictionary[data_dictionary['Feature'] == 'PCT_65_PLUS'].index[0]] = ['PCT_65_PLUS', 'numeric', 'Populaiton aged 65 and over, proportion of total population']
data_dictionary.iloc[data_dictionary[data_dictionary['Feature'] == 'MALE'].index[0]] = ['MALE', 'numeric', 'Male population']
data_dictionary.iloc[data_dictionary[data_dictionary['Feature'] == 'FEMALE'].index[0]] = ['FEMALE', 'numeric', 'Female population']
data_dictionary.iloc[data_dictionary[data_dictionary['Feature'] == 'MF_RATE'].index[0]] = ['MF_RATE', 'numeric', 'Male to female population rate']
null_data = data_dictionary[data_dictionary.isnull().any(axis=1)]
null_data, len(null_data)
Out[140]:
(   Feature     Type Description
 87   POP16  numeric         NaN, 1)

There is one more feature left but it is again 2016 population. Let's check the original dataset for that column and compare it with column POP_2016:

In [141]:
rawdf[{'POP_2016','POP16'}]
Out[141]:
POP16 POP_2016
0 35151728 35151730
1 519716 519715
2 313267 313270
3 92690 92690
4 77687 77685
5 36072 36070
6 142907 142910
7 142907 142905
8 923598 923600
9 194376 194375
10 146249 146245
11 158936 158935
12 424037 424040
13 747101 747100
14 209256 209255
15 170537 170535
16 174348 174350
17 47775 47775
18 25250 25250
19 76374 76375
20 43561 43560
21 8164361 8164360
22 197385 197385
23 276368 276370
24 729997 729995
25 508511 508515
26 472615 472615
27 1942044 1942045
28 382604 382605
29 146717 146715
30 92518 92515
31 14232 14230
32 90311 90315
33 420082 420080
34 422993 422995
35 494796 494795
36 589400 589400
37 1353459 1353455
38 13188 13185
39 17141 17145
40 13448494 13448490
41 113084 113085
42 134943 134940
43 645862 645865
44 88978 88975
45 161977 161975
46 109652 109650
47 179083 179085
48 548430 548430
49 536917 536920
50 161180 161180
51 59297 59300
52 102042 102045
53 193363 193365
54 126638 126640
55 169244 169245
56 455526 455525
57 447888 447885
58 123820 123820
59 76455 76450
60 934243 934245
61 110862 110865
62 1381744 1381745
63 76796 76800
64 138236 138235
65 84201 84195
66 103593 103590
67 202762 202760
68 540249 540245
69 196448 196445
70 151884 151890
71 33049 33050
72 535154 535155
73 284461 284460
74 398953 398955
75 1109909 1109910
76 2731571 2731570
77 1278365 1278365
78 720883 720885
79 165600 165600
80 127601 127610
81 72220 72220
82 192061 192050
83 1098352 1098350
84 56230 56220
85 53621 53615
86 42967 42965
87 278281 278270
88 55094 55110
89 344458 344515
90 41890 41880
91 39574 39565
92 75637 75605
93 75147 75145
94 22282 22280
95 10539 10540
96 2632 2630
97 35453 35455
98 4067175 4067175
99 291112 291110
100 1551876 1551875
101 461553 461555
102 1320798 1320800
103 441836 441835
104 4648055 4648055
105 79856 79855
106 78463 78465
107 362258 362260
108 219467 219470
109 295763 295755
110 639245 639245
111 784977 784975
112 198309 198310
113 649028 649025
114 284389 284400
115 383360 383360
116 270817 270815
117 122233 122235
118 71553 71545
119 140452 140450
120 67885 67885
121 35874 35875
122 35874 35875
123 41786 41785
124 41786 41785
125 35944 35940
126 35944 35945

There is a minor difference, we ignore it and drop column POP16 from df and data_dictionary:

In [142]:
data_dictionary = data_dictionary.drop([data_dictionary[data_dictionary['Feature'] == 'POP16'].index[0]], axis = 0)
df = df.drop(['POP16'], axis=1)
df.head()
Out[142]:
Geocode Geo Name GNR POP_2016 POP_MIZ POP_MIZ_RATE POP_LRG_POP_CNTR LRG_POP_CNTR_RATE POP_MED_POP_CNTR MED_POP_CNTR_RATE POP_SML_POP_CNTRE SML_POP_CNTRE_RATE POP_RRL RRL_RATE POP_DENSE CENSUS_FAM_PRIV_HHLDS LONE LONE_RATE LONE_FEMALE LONE_FEMALE_RATE LONE_MALE LONE_MALE_RATE GNR.1 POP_PRIV_HHLD_2016 POP_25_29 HSG HSG_RATE HSG_ONLY HSG_ONLY_RATE HSG_PSG_BELOW_BACH HSG_PSG_BELOW_BACH_RATE HSG_PSG_BACH HSG_PSG_BACH_RATE HSG_PSG_ABOVE_BACH HSG_PSG_ABOVE_BACH_RATE POP_25_54 PSG PSG_RATE PSG_BELOW_BACH PSG_BELOW_BACH_RATE PSG_BACH PSG_BACH_RATE PSG_ABOVE_BACH PSG_ABOVE_BACH_RATE LABOUR UE UE_RATE ECON_FAM_LW_INC_DENOM LW_INC_ECON_FAM LW_INC_ECON_FAM_RATE UNATTACHED_LW_INC_DENOM LOW_INC_UNATTACHED LOW_INC_UNATTACHED_RATE POP_PRIV_HHLDS_LW_INC_DENOM POP_PRIV_HHLDS_LW_INC POP_PRIV_HHLDS_LW_INC_RATE CHILD17_ECON_FAM_LOW_INC_DENOM CHILD17_ECON_FAM_LOW_INC CHILD17_ECON_FAM_LOW_INC_RATE AVE_PERS_INC AVE_PERS_INC_MALE AVE_PERS_INC_FEMALE AVE_DWELL RENTER_SHELCO_HHLD RENTER_OVER30 RENTER_OVER30_RATE OWNER_SHELCO_HHLD OWNER_OVER30 OWNER_OVER30_RATE MEDIAN_HHLD_INC INC_SHR_BELOW_MEDIAN GOVT_TRNSFR_INC ABO ABO_RATE IMM IMM_RATE IMM_2006_2016 IMM_2006_2016_RATE_IMM IMM_2006_2016_RATE_TOTPOP VIS_MIN VISMIN_RATE POP_ONE_PLUS ONE_YR_INT_MIG ONE_YR_INT_MIG_RATE POP_FIVE_PLUS FIVE_YR_INT_MIG FIVE_YR_INT_MIG_RATE POP11 GROWTH POP_0_19 PCT_0_19 POP_65_PLUS PCT_65_PLUS MALE FEMALE MF_RATE EMP_25_54 EMP_RATE_25_54 SHELCO_HH OVER30_HH UNEMP_RATE
0 1 Canada 4.0 35151730 31226645 88.8 20938295 59.6 3179190 9.0 4458180 12.7 6576060 18.7 3.9 9840725 1612805 16.4 1262335 12.8 350465 3.6 5.1 34460065 2266030 2033665 89.7 529485 26.0 722825 35.5 581550 28.6 199815 9.8 14076325 9562680 67.9 5144695 53.8 2967120 31.0 1450860 15.2 18672470 814645 4.4 9573075 839765 8.8 5252515 1729655 32.9 33968190 4324825 12.7 6821825 961010 14.1 47487 56740 38632 443058 4452850 1775570 40.0 9368330 1550380 16.6 70332 20.5 11.7 1673780 4.9 7540830 21.9 2268170 30.1 6.6 7674580 22.3 34091785 1451000 4.3 32568565 4296715 13.2 33476688 5.0 7865725 22.4 5935630 16.9 17264200 17887530 1.0 11323575 80.4 13821180 3325950 7.7
1 10 Newfoundland and Labrador / Terre-Neuve-et-Lab... 4.0 519715 312530 60.1 178430 34.3 0 0.0 123290 23.7 218000 41.9 1.4 161005 24250 15.1 19290 12.0 4955 3.1 6.8 512250 28130 25455 90.5 6375 25.0 11515 45.2 5950 23.4 1615 6.3 203385 134670 66.2 91545 68.0 28820 21.4 14305 10.6 256855 24455 9.5 158975 9415 5.9 69330 19125 27.6 509340 45045 8.8 89210 10240 11.5 45210 56724 34259 243157 50415 19810 39.4 167335 17695 10.6 67320 24.2 16.5 45725 8.9 12075 2.4 5540 45.9 1.1 11815 2.3 507965 22185 4.4 489800 66305 13.5 514536 1.0 101695 19.6 101030 19.4 253925 265790 1.0 146440 72.0 217750 37500 15.6
2 1011 1011 Eastern Regional Health Authority 3.7 313270 235790 75.3 178425 57.0 0 0.0 32520 10.4 102325 32.7 16.2 94510 15045 15.9 12025 12.7 3020 3.2 6.5 309015 19165 17785 92.8 4175 23.5 7470 42.0 4780 26.9 1360 7.6 127455 89625 70.3 56775 63.3 21620 24.1 11225 12.5 161620 11245 7.0 94065 5740 6.1 44770 13255 29.6 309015 29230 9.5 55415 6570 11.9 48517 60868 36888 285746 31990 13000 40.8 98985 11875 12.0 72211 20.3 13.8 8635 2.8 9225 3.0 4360 47.3 1.4 9430 3.1 306240 13165 4.3 294790 40710 13.8 306517 2.2 62495 19.9 56040 17.9 152620 160650 1.0 96710 75.9 130975 24880 11.9
3 1012 1012 Central Regional Health Authority 4.1 92690 43275 46.7 0 0.0 0 0.0 31860 34.4 60835 65.6 2.2 30630 3570 11.7 2825 9.2 740 2.4 6.8 91160 3430 2895 84.4 905 31.3 1525 52.7 370 12.8 85 2.9 32955 18250 55.4 14490 79.4 2545 13.9 1220 6.7 41070 6080 14.8 30070 1705 5.7 10295 2535 24.6 90205 7165 7.9 14680 1660 11.3 37795 48390 27521 162012 7510 2985 39.9 31920 2625 8.2 56442 21.9 24.2 5270 5.8 855 0.9 295 34.5 0.3 780 0.9 90575 3995 4.4 88000 11715 13.3 93642 -1.0 16690 18.0 22100 23.8 45450 47240 1.0 20900 63.4 39430 5610 23.1
4 1013 1013 Western Regional Health Authority 4.7 77685 33470 43.1 0 0.0 0 0.0 41830 53.8 35855 46.2 2.5 24655 4065 16.5 3275 13.3 790 3.2 8.1 76520 3315 2980 89.9 845 28.4 1520 51.0 495 16.6 120 4.0 27715 17280 62.3 12855 74.4 3130 18.1 1295 7.5 35455 4485 12.6 24275 1650 6.8 11025 2810 25.5 76525 7185 9.4 12240 1605 13.1 38028 46608 29950 195672 8065 3375 41.9 25995 2665 10.2 55950 21.5 22.9 19510 25.5 1040 1.4 360 34.6 0.5 685 0.9 75995 3820 5.0 73560 10170 13.8 77983 -0.4 13945 18.0 17840 23.0 37710 39975 0.9 18205 65.7 34055 6040 21.2

Finally our data dictionary is complete and it describes all the features in data frame:

In [143]:
data_dictionary
Out[143]:
Feature Type Description
0 Geocode numeric Geographic code: 1-4 digits
1 Geo Name string Geographical names: Canada, provinces and terr...
2 GNR numeric Global non-response rate
3 POP_2016 numeric Total population
4 POP_MIZ numeric Population living within a Census Metropolitan...
5 POP_MIZ_RATE numeric Population living within a Census Metropolitan...
6 POP_LRG_POP_CNTR numeric Large urban population centre population
7 LRG_POP_CNTR_RATE numeric Large urban population centre population, prop...
8 POP_MED_POP_CNTR numeric Medium population centre population
9 MED_POP_CNTR_RATE numeric Medium population centre population, proportio...
10 POP_SML_POP_CNTRE numeric Small population centre population
11 SML_POP_CNTRE_RATE numeric Small population centre population, proportion...
12 POP_RRL numeric Rural population
13 RRL_RATE numeric Rural population, proportion of total population
14 POP_DENSE numeric Population density (population per square kilo...
15 CENSUS_FAM_PRIV_HHLDS numeric Census families in private households
16 LONE numeric Lone-parent families
17 LONE_RATE numeric Lone-parent families, proportion of census fam...
18 LONE_FEMALE numeric Female lone-parent families
19 LONE_FEMALE_RATE numeric Female lone-parent families, proportion of cen...
20 LONE_MALE numeric Male lone-parent families
21 LONE_MALE_RATE numeric Male lone-parent families, proportion of censu...
22 GNR.1 numeric Global non-response rate
23 POP_PRIV_HHLD_2016 numeric Total population in private households
24 POP_25_29 numeric Population aged 25 to 29
25 HSG numeric High school graduates aged 25 to 29
26 HSG_RATE numeric High school graduates aged 25 to 29, proportio...
27 HSG_ONLY numeric High school graduates only, aged 25 to 29
28 HSG_ONLY_RATE numeric High school graduates only, aged 25 to 29, pro...
29 HSG_PSG_BELOW_BACH numeric Post-secondary graduates, below bachelor's deg...
30 HSG_PSG_BELOW_BACH_RATE numeric Post-secondary graduates, below bachelor's deg...
31 HSG_PSG_BACH numeric Post-secondary graduates, bachelor's degree, a...
32 HSG_PSG_BACH_RATE numeric Post-secondary graduates, bachelor's degree, a...
33 HSG_PSG_ABOVE_BACH numeric Post-secondary graduates, above bachelor's deg...
34 HSG_PSG_ABOVE_BACH_RATE numeric Post-secondary graduates, above bachelor's deg...
35 POP_25_54 numeric Population aged 25 to 54
36 PSG numeric Post-secondary graduates aged 25 to 54
37 PSG_RATE numeric Post-secondary graduates aged 25 to 54, propor...
38 PSG_BELOW_BACH numeric Post-secondary graduates, below bachelor's deg...
39 PSG_BELOW_BACH_RATE numeric Post-secondary graduates, below bachelor's deg...
40 PSG_BACH numeric Post-secondary graduates, bachelor's degree, a...
41 PSG_BACH_RATE numeric Post-secondary graduates, bachelor's degree, a...
42 PSG_ABOVE_BACH numeric Post-secondary graduates, above bachelor's deg...
43 PSG_ABOVE_BACH_RATE numeric Post-secondary graduates, above bachelor's deg...
44 LABOUR numeric Labour force aged 15 and over
45 UE numeric Long-term unemployed
46 UE_RATE numeric Long-term unemployment rate, labour force aged...
47 ECON_FAM_LW_INC_DENOM numeric Total economic families for income status
48 LW_INC_ECON_FAM numeric Economic families in low income before tax in ...
49 LW_INC_ECON_FAM_RATE numeric Prevalence of low income before tax in 2015 fo...
50 UNATTACHED_LW_INC_DENOM numeric Total persons 15 years and over not in an econ...
51 LOW_INC_UNATTACHED numeric Persons 15 years and over not in economic fami...
52 LOW_INC_UNATTACHED_RATE numeric Prevalence of low income before tax in 2015 fo...
53 POP_PRIV_HHLDS_LW_INC_DENOM numeric Population in private households for income st...
54 POP_PRIV_HHLDS_LW_INC numeric Persons in private households in low income be...
55 POP_PRIV_HHLDS_LW_INC_RATE numeric Prevalence of low income before tax in 2015 fo...
56 CHILD17_ECON_FAM_LOW_INC_DENOM numeric Total persons aged 17 years and under living i...
57 CHILD17_ECON_FAM_LOW_INC numeric Persons aged 17 years and under living in low ...
58 CHILD17_ECON_FAM_LOW_INC_RATE numeric Prevalence of persons aged 17 years and under ...
59 AVE_PERS_INC numeric Average total income in 2015 of population 15 ...
60 AVE_PERS_INC_MALE numeric Average total income in 2015 of males 15 years...
61 AVE_PERS_INC_FEMALE numeric Average total income in 2015 of females 15 yea...
62 AVE_DWELL numeric Average value of dwelling
63 RENTER_SHELCO_HHLD numeric Tenant-occupied private non-farm, non-reserve ...
64 RENTER_OVER30 numeric Tenant-occupied households spending 30% or mor...
65 RENTER_OVER30_RATE numeric Proportion of tenant-occupied households spend...
66 OWNER_SHELCO_HHLD numeric Owner-occupied private non-farm, non-reserve d...
67 OWNER_OVER30 numeric Owner households spending 30% or more of house...
68 OWNER_OVER30_RATE numeric Proportion of owner households spending 30% or...
69 MEDIAN_HHLD_INC numeric Median household income
70 INC_SHR_BELOW_MEDIAN numeric Income share held by households whose incomes ...
71 GOVT_TRNSFR_INC numeric Government transfer income in 2015, as a propo...
72 ABO numeric Aboriginal identity population
73 ABO_RATE numeric Aboriginal identity population, proportion of ...
74 IMM numeric Immigrant population
75 IMM_RATE numeric Immigrant population, proportion of total popu...
76 IMM_2006_2016 numeric Immigrants who arrived between 2006 and 2016
77 IMM_2006_2016_RATE_IMM numeric Immigrants who arrived between 2006 and 2016, ...
78 IMM_2006_2016_RATE_TOTPOP numeric Immigrants who arrived between 2006 and 2016, ...
79 VIS_MIN numeric Visible minority population
80 VISMIN_RATE numeric Visible minority population, proportion of tot...
81 POP_ONE_PLUS numeric Population aged 1 year and over
82 ONE_YR_INT_MIG numeric 1-year internal migrants
83 ONE_YR_INT_MIG_RATE numeric 1-year internal migrants, proportion of popula...
84 POP_FIVE_PLUS numeric Population aged 5 years and over
85 FIVE_YR_INT_MIG numeric 5-year internal migrants
86 FIVE_YR_INT_MIG_RATE numeric 5-year internal migrants, proportion of popula...
88 POP11 numeric 2011 population
89 GROWTH numeric 2011 to 2016 population growth (%)
90 POP16_2A numeric Total population
91 POP_0_19 numeric Population aged 0 to 19
92 PCT_0_19 numeric Population aged 0 to 19, proportion of total p...
93 POP_65_PLUS numeric Populaiton aged 65 and over
94 PCT_65_PLUS numeric Populaiton aged 65 and over, proportion of tot...
95 MALE numeric Male population
96 FEMALE numeric Female population
97 MF_RATE numeric Male to female population rate
98 POP_25_54.1 numeric Population aged 25 to 54
99 EMP_25_54 numeric Employed persons aged 25 to 54 years
100 EMP_RATE_25_54 numeric Employment rate, 25 to 54 years
101 SHELCO_HH numeric Total private non-farm, non-reserve households
102 OVER30_HH numeric Households spending 30% or more of household i...
103 HOU_AFF numeric Households spending 30% or more of household i...
104 UNEMP_RATE numeric Unemployment rate 15 years and over

In making data dictionary we came up with columns MALE, FEMALE and MF_RATE that did not exist in short form or long form that we downloaded from the website. Let's have a closer look:

In [144]:
rawdf[{'MALE','FEMALE', 'MF_RATE'}]
Out[144]:
MF_RATE FEMALE MALE
0 1.0 17887530 17264200
1 1.0 265790 253925
2 1.0 160650 152620
3 1.0 47240 45450
4 0.9 39975 37710
5 1.0 17925 18145
6 0.9 73605 69305
7 0.9 73600 69305
8 0.9 476715 446880
9 0.9 99950 94430
10 0.9 75330 70915
11 0.9 82750 76185
12 0.9 218685 205350
13 1.0 381745 365360
14 1.0 106700 102555
15 0.9 88080 82455
16 1.0 88680 85670
17 1.0 24360 23415
18 0.9 12990 12260
19 1.0 38685 37690
20 1.0 22250 21315
21 1.0 4147605 4016760
22 1.0 99885 97500
23 1.0 138400 137965
24 1.0 372790 357205
25 1.0 256960 251550
26 1.0 238120 234490
27 0.9 997560 944485
28 1.0 194410 188195
29 1.0 72820 73900
30 1.0 45855 46660
31 1.1 6750 7475
32 1.0 46120 44190
33 1.0 210135 209950
34 0.9 217485 205505
35 1.0 249480 245315
36 1.0 297340 292055
37 1.0 688445 665010
38 1.0 6500 6690
39 1.0 8540 8605
40 1.0 6889105 6559390
41 1.0 57820 55270
42 1.0 69165 65780
43 1.0 331190 314675
44 1.0 45240 43735
45 1.0 82255 79720
46 1.0 55125 54525
47 1.0 90760 88325
48 0.9 282085 266350
49 1.0 274390 262525
50 1.0 82180 79005
51 1.0 29940 29360
52 1.0 52295 49750
53 1.0 98865 94500
54 1.0 64640 61995
55 1.0 86745 82500
56 0.9 234295 221235
57 0.9 230970 216915
58 1.0 62935 60890
59 1.0 38260 38190
60 0.9 480370 453880
61 1.0 56115 54745
62 1.0 703045 678700
63 1.0 39175 37620
64 0.9 71460 66775
65 1.0 42240 41955
66 1.0 51865 51730
67 1.0 103080 99685
68 1.0 274225 266025
69 1.0 99860 96585
70 1.0 76685 75200
71 1.0 16680 16365
72 1.0 270885 264265
73 1.0 144450 140015
74 1.0 202695 196260
75 1.0 569150 540760
76 0.9 1417980 1313585
77 1.0 646970 631395
78 1.0 368665 352215
79 1.0 83880 81720
80 1.0 63010 64595
81 1.0 35765 36455
82 1.0 95650 96405
83 1.0 552570 545780
84 1.0 27860 28365
85 1.0 26980 26635
86 1.0 21540 21430
87 1.0 140405 137860
88 1.0 27775 27340
89 1.0 173815 170700
90 1.0 20755 21125
91 1.0 19930 19640
92 1.0 37945 37660
93 1.0 37920 37225
94 1.0 10995 11280
95 1.0 5355 5185
96 1.0 1305 1330
97 1.0 17660 17795
98 1.0 2027765 2039405
99 1.0 146335 144775
100 1.0 776940 774935
101 1.0 230035 231520
102 1.0 660610 660180
103 1.1 213840 227995
104 1.0 2369815 2278240
105 1.0 39960 39895
106 1.0 39335 39130
107 0.9 186740 175515
108 1.0 110265 109205
109 1.0 149220 146535
110 1.0 326245 313005
111 1.0 398985 385990
112 0.9 103650 94655
113 1.0 332115 316915
114 0.9 146420 137980
115 0.9 198460 184900
116 1.0 138740 132075
117 1.0 62175 60060
118 1.0 35125 36420
119 1.0 69370 71085
120 1.1 33005 34885
121 1.0 17795 18080
122 1.0 17800 18080
123 1.0 20505 21275
124 1.0 20510 21275
125 1.0 17550 18395
126 1.0 17555 18395

It seems female and male population are almost equal in all regions and their rate is almost 1, let's count:

In [145]:
df['MF_RATE'].value_counts()
Out[145]:
1.0    102
0.9     22
1.1      3
Name: MF_RATE, dtype: int64

Often the rate is 1 or very close to 1, then we can drop these three columns from df:

In [146]:
df = df.drop(['MALE','FEMALE','MF_RATE'], axis=1)
In [147]:
print("\nNumber of features of the old data frame 'rawdf' is:", rawdf.shape[1], "\nNumber of features of the new data frame 'df' is:",  
      df.shape[1] )
Number of features of the old data frame 'rawdf' is: 105 
Number of features of the new data frame 'df' is: 98
  • The main goal : Understanding data by exploring and trying to spot unlikely and irregular patterns.

  • What are the features?

  • What type of features do we have: numerical, categorical, or both?
  • Is there missing data or null values?
  • Is there a correlation between the features?
  • Are there duplicated features?
  • Do we need all the data to perform proper analysis, or do we need a subset to ensure faster performance?
  • Is the data complete? Accurate? Up to date?
  • In its current state, can we use the data to answer our questions?
  • If there are inconsistencies or redundant values, what do we need to do to clean the data? Is it a matter of manually changing a few values or will a more systematic approach be necessary?
  • Do we need to further consolidate the data? In other words, could we pick a subset of data for the types of analysis we want to perform?

6. Feature Selection

So far we dropped duplicated and some other columns from dataset df. Yet we have too many features in this dataset. In this section the goal is to select the "right" features for future analysis. A practical way is looking at the pairwise correlation between features. Correlation is a measure of how close two variables are to having a linear relationship with each other. When the correlation between two features is high, those two features are more linearly dependent and hence have almost the same effect on the dependent variable. So, when two features have high correlation, we can drop one of them. Although we haven't specified a target variable, yet it is helpful to investigate pairwise correlation.

It is easy to distinguish two types of features in this dataset: those that represent rate, and those that are population or others, we call them non-rate. A quick way for primary division is pick columns containing RATE or PCT in their name. We also include column GROWTH which is essentially rate. We also exclude Geo Name from non-rate features which is a string, we still have Geocode to connect with Geo Name.

In [148]:
# collecting rate features in df_rate
df.rename(columns={'GROWTH':'GROWTH_RATE','PCT_0_19':'POP_0_19_RATE', 'PCT_65_PLUS':'POP_65_RATE'}, inplace=True)
df_rate = df.loc[:,df.columns.str.contains('RATE')]
# collecting non-rate features in df_nrate, excluding Geo Name which is a string, we still have Geocode to connect with Geo Name
df_nrate = df.loc[:,~df.columns.str.contains('RATE')]
df_nrate = df_nrate.drop(['Geo Name'], axis = 1)
print("\nNumber of features in rate category is:", df_rate.shape[1], "\nNumber of features in non-rate category is:",  
      df_nrate.shape[1] )
Number of features in rate category is: 36 
Number of features in non-rate category is: 61

First let's have a look at the pairwise correlation between features in non-rate category:

In [149]:
corr_nrate = df_nrate.corr()
sns.heatmap(corr_nrate)
    
plt.title('Correlation between different features in Non-rate category')
Out[149]:
Text(0.5, 1, 'Correlation between different features in Non-rate category')

In the heat map above, very bright cells indicating highly correlated features. We compare the correlation between features and remove one of two features that have a correlation higher than 0.9:

In [150]:
columns = np.full((corr_nrate.shape[0],), True, dtype=bool)
for i in range(corr_nrate.shape[0]):
    for j in range(i+1, corr_nrate.shape[0]):
        if corr_nrate.iloc[i,j] >= 0.9:
            if columns[j]:
                columns[j] = False
selected_columns = df_nrate.columns[columns]
df_nrate = df_nrate[selected_columns]
df_nrate
Out[150]:
Geocode GNR POP_2016 POP_DENSE GNR.1 AVE_PERS_INC AVE_PERS_INC_FEMALE AVE_DWELL MEDIAN_HHLD_INC INC_SHR_BELOW_MEDIAN GOVT_TRNSFR_INC
0 1 4.0 35151730 3.9 5.1 47487 38632 443058 70332 20.5 11.7
1 10 4.0 519715 1.4 6.8 45210 34259 243157 67320 24.2 16.5
2 1011 3.7 313270 16.2 6.5 48517 36888 285746 72211 20.3 13.8
3 1012 4.1 92690 2.2 6.8 37795 27521 162012 56442 21.9 24.2
4 1013 4.7 77685 2.5 8.1 38028 29950 195672 55950 21.5 22.9
5 1014 4.7 36070 0.1 6.9 51808 38477 205671 94659 23.7 12.8
6 11 4.1 142910 25.1 5.6 38899 34171 197966 61026 23.2 17.6
7 1100 4.1 142905 25.1 5.6 38899 34171 197966 61026 23.2 17.6
8 12 3.9 923600 17.4 5.5 41479 34132 230441 60745 30.5 15.3
9 1201 4.1 194375 11.0 6.5 36861 29585 196741 53832 22.6 19.5
10 1202 3.9 146245 11.8 5.9 38836 31348 177131 56018 21.9 17.8
11 1203 4.2 158935 10.0 6.2 37425 31464 165757 54833 21.6 23.1
12 1204 3.7 424040 61.4 4.7 46070 38196 302104 68950 21.6 10.6
13 13 3.9 747100 10.5 4.9 39141 32504 170071 59313 26.3 16.9
14 1301 3.9 209255 20.9 4.7 39683 34137 183055 61246 23.3 15.6
15 1302 3.5 170535 16.4 5.0 41459 33258 191723 61609 21.8 14.8
16 1303 3.9 174350 7.5 4.5 40351 33666 183356 62196 23.2 14.4
17 1304 4.1 47775 6.1 4.9 35746 29788 138563 53333 23.1 19.5
18 1305 4.5 25250 4.6 6.3 34676 29167 110786 48647 22.1 23.7
19 1306 4.1 76375 16.3 6.3 35671 28365 128231 52826 22.0 24.5
20 1307 3.8 43560 4.4 4.0 35710 29576 132839 55403 23.2 23.8
21 24 3.7 8164360 6.0 4.3 42546 36203 290484 59808 27.6 15.6
22 2401 4.1 197385 8.9 5.0 36525 31474 168853 52285 23.5 21.5
23 2402 3.9 276370 2.8 4.8 39565 31558 196031 58024 23.3 18.8
24 2403 4.1 729995 38.8 4.3 44574 38432 290478 62859 22.8 14.4
25 2404 4.3 508515 11.9 4.5 36971 31141 177975 51102 22.6 20.1
26 2405 4.1 472615 37.1 4.7 38763 33599 244116 54421 22.5 18.4
27 2406 3.8 1942045 3889.8 4.4 43670 37410 462244 52516 18.4 14.1
28 2407 3.6 382605 12.4 4.7 44868 41064 268365 67356 22.8 14.1
29 2408 4.5 146715 2.5 5.9 43402 32978 203355 61283 21.7 16.5
30 2409 5.7 92515 0.4 8.8 44199 34248 171374 66151 22.8 17.0
31 2410 7.0 14230 0.0 4.8 46759 35116 152461 72590 24.4 14.3
32 2411 4.2 90315 4.4 5.7 34666 30991 154609 52160 23.4 28.3
33 2412 3.7 420080 27.9 4.3 40838 34454 212614 61525 23.9 16.8
34 2413 2.9 422995 1710.9 3.0 42879 37664 340865 70077 23.4 14.4
35 2414 3.0 494795 39.8 4.2 40842 34676 250753 65597 24.2 16.9
36 2415 3.3 589400 28.4 4.4 43150 35874 280498 63924 22.6 15.5
37 2416 3.1 1353455 157.3 3.4 45298 37820 301439 68627 22.8 14.1
38 2417 12.8 13185 0.0 13.4 40228 42173 388423 80619 25.2 17.9
39 2418 10.4 17145 3.1 11.3 38076 40287 0 97335 28.9 24.0
40 35 3.7 13448490 14.8 4.6 47915 39585 506409 74260 22.4 11.1
41 3526 4.8 113085 2.7 6.5 40705 33672 213102 58376 21.8 18.9
42 3527 4.2 134940 119.5 5.4 42000 35122 337344 68669 23.2 14.8
43 3530 3.0 645865 255.9 4.0 49018 40434 503610 89792 23.9 10.2
44 3531 3.7 88975 47.3 4.5 40341 34205 258635 66482 24.1 15.8
45 3533 5.4 161975 18.8 6.2 45147 35522 336645 66118 21.3 14.5
46 3534 4.5 109650 38.4 5.4 41749 34525 313599 70721 24.1 15.5
47 3535 5.0 179085 19.8 5.6 42599 34987 344823 67475 23.0 16.6
48 3536 2.7 548430 568.9 3.2 64762 48449 715422 103184 20.7 6.7
49 3537 3.8 536920 480.5 4.9 43966 37103 430555 69110 21.3 13.3
50 3538 4.5 161180 22.5 5.8 39629 33594 271210 61801 23.3 17.9
51 3539 5.0 59300 17.4 6.4 41536 35197 287632 65978 23.5 15.5
52 3540 4.3 102045 41.3 4.9 39021 33385 191420 58147 22.6 17.9
53 3541 4.7 193365 29.2 4.4 45891 39406 340077 69914 22.2 13.2
54 3542 4.2 126640 42.2 4.6 48249 35485 260674 69927 21.3 13.7
55 3543 4.5 169245 26.4 6.2 43410 36917 310018 69568 23.5 15.3
56 3544 3.5 455525 137.3 4.4 44431 38423 307098 64787 20.9 12.6
57 3546 4.0 447885 241.5 5.0 41858 34827 323219 65141 22.3 15.4
58 3547 5.4 123820 7.3 6.9 41119 34447 292932 60463 21.9 17.5
59 3549 7.8 76450 0.4 8.8 42726 36291 230241 67669 22.2 15.5
60 3551 3.5 934245 334.8 4.0 54605 47019 444589 85969 21.8 8.5
61 3552 3.7 110865 54.4 4.7 43553 35968 301723 72345 24.3 13.4
62 3553 2.5 1381745 1108.1 3.8 42651 36287 618409 86314 23.4 10.4
63 3554 3.9 76800 34.6 3.7 43635 36643 333753 70355 23.7 12.5
64 3555 4.6 138235 35.9 5.1 41943 35516 339461 64511 21.7 15.8
65 3556 6.2 84195 0.3 7.8 44832 34942 191925 67897 21.6 14.8
66 3557 5.1 103590 6.9 5.3 42424 34761 264968 67503 24.0 15.4
67 3558 4.2 202760 38.2 5.0 41932 36531 265597 66931 22.9 15.2
68 3560 4.2 540245 61.4 4.7 45017 36944 416454 75296 23.1 13.0
69 3561 4.6 196445 4.2 6.2 46788 37276 285738 69320 21.4 13.6
70 3562 5.9 151890 0.7 6.4 44031 36755 241907 67548 22.2 15.0
71 3563 5.2 33050 2.3 7.0 41425 32257 200382 58790 21.3 17.5
72 3565 3.6 535155 390.9 4.4 46906 38126 395678 77566 22.5 10.4
73 3566 3.8 284460 68.6 4.3 48585 39933 462894 83118 23.0 10.0
74 3568 4.2 398955 215.5 5.1 44139 35650 246774 66815 21.2 14.0
75 3570 2.4 1109910 629.9 3.5 50867 41534 871831 95773 21.1 8.4
76 3595 3.8 2731570 4334.4 4.9 52268 42807 754015 65808 16.4 9.3
77 46 4.3 1278365 2.3 5.3 43767 36747 297444 68070 25.7 11.9
78 4601 3.3 720885 1112.3 3.9 45436 38431 322679 68955 21.6 10.9
79 4602 5.0 165600 2.6 6.6 41671 35378 227861 61613 22.0 14.0
80 4603 5.5 127610 1.6 7.2 43166 35508 299416 70323 22.6 13.6
81 4604 9.7 72220 0.2 12.1 37539 32418 196929 65829 20.6 15.9
82 4605 4.4 192050 7.1 6.2 41284 33157 276578 69541 24.0 12.3
83 47 4.8 1098350 1.9 6.3 49409 40184 318917 75388 21.6 10.5
84 4701 4.5 56220 1.7 6.6 56970 40790 281444 82781 21.5 8.2
85 4702 4.7 53615 2.0 6.7 46462 37212 255753 68412 22.6 12.1
86 4703 6.0 42965 1.0 6.5 46759 38117 241633 70561 22.8 11.1
87 4704 4.1 278270 10.8 5.6 52820 44213 359381 80830 22.1 9.1
88 4705 4.9 55110 2.3 6.4 44035 35701 230256 61594 20.7 14.1
89 4706 4.6 344515 10.7 5.3 51962 41476 369323 80650 21.9 9.0
90 4707 4.9 41880 1.0 7.4 49279 39440 220827 75603 23.0 10.4
91 4708 5.3 39565 0.9 7.3 40589 33093 199883 59522 21.3 16.4
92 4709 5.9 75605 2.5 8.3 42729 37149 278629 65409 21.1 14.9
93 4710 5.1 75145 2.6 8.6 44341 36314 294239 71930 21.7 12.3
94 4711 6.8 22280 0.2 8.4 32772 29370 222750 55995 18.4 18.9
95 4712 7.1 10540 0.1 10.3 31948 31137 175715 47680 18.0 27.3
96 4713 4.6 2630 0.1 5.7 28882 24812 74706 62080 20.1 18.8
97 4714 6.7 35455 0.1 8.8 32241 29598 203663 53050 18.3 21.4
98 48 4.7 4067175 6.4 6.1 62778 45535 449790 93931 20.6 6.7
99 4831 5.1 291110 4.5 6.4 48999 37548 314814 75075 22.2 10.7
100 4832 4.1 1551875 40.0 5.0 68678 49902 527083 98775 19.6 5.5
101 4833 5.1 461555 5.0 6.6 56227 39723 353427 83798 21.2 8.7
102 4834 4.4 1320800 114.0 5.7 60744 44894 437538 94393 21.5 7.0
103 4835 7.0 441835 1.0 10.3 63625 42884 394880 103307 21.5 6.5
104 59 4.9 4648055 5.0 6.1 45616 36901 720689 69979 20.5 11.1
105 5911 6.6 79855 1.8 9.2 46309 34334 355861 71212 22.6 12.9
106 5912 6.7 78465 2.8 9.1 40596 31474 330002 57917 21.7 16.1
107 5913 5.2 362260 17.4 6.5 44395 35250 482594 65770 21.5 14.1
108 5914 6.2 219470 1.9 7.8 42983 33129 355697 66296 22.1 14.7
109 5921 4.9 295755 26.1 5.7 40287 32088 476586 69474 22.5 14.6
110 5922 4.0 639245 226.4 5.1 43940 36119 810935 72993 21.0 9.6
111 5923 4.3 784975 937.8 6.1 43136 34812 772504 79701 22.5 11.1
112 5931 3.4 198310 1534.1 4.3 38039 32680 967385 65368 19.8 10.5
113 5932 5.4 649025 4942.6 6.4 50267 42685 1410609 65076 17.2 7.5
114 5933 5.3 284400 5.5 6.1 60258 45982 1181412 77061 16.9 8.0
115 5941 5.1 383360 163.8 5.0 48004 40608 634182 69642 21.2 10.9
116 5942 5.5 270815 22.2 6.0 41331 33707 418203 62311 22.3 16.5
117 5943 4.8 122235 3.0 5.4 41692 32863 366613 62501 22.8 16.3
118 5951 6.4 71545 0.3 8.1 44573 34847 293261 71591 22.4 13.2
119 5952 4.9 140450 0.8 7.2 46522 34903 265389 74445 22.6 12.0
120 5953 5.5 67885 0.4 9.3 57011 38614 371331 94037 23.2 7.1
121 60 6.7 35875 0.1 6.8 53809 50023 371716 84631 22.6 8.9
122 6001 6.7 35875 0.1 6.8 53809 50023 371716 84631 22.6 8.9
123 61 7.5 41785 0.0 8.8 64586 58918 346427 117723 22.1 7.4
124 6101 7.5 41785 0.0 8.8 64586 58918 346427 117723 22.1 7.4
125 62 7.9 35940 0.0 8.7 50689 50492 352070 97441 20.7 11.9
126 6201 7.9 35945 0.0 8.7 50689 50492 352070 97441 20.7 11.9
In [151]:
df_nrate.shape[1]
Out[151]:
11

Out of 61 features in non-rate category, there are 50 that are highly correlated ( >=.9). Therefore we can reduce the number of features in this category to 11. Actually some of them are proportions or statistics like average or median. Only one is population. That is, all population columns are redundant to 1 which is 2016 population. Now let's check the correlation between features in the rate category:

In [152]:
corr_rate = df_rate.corr()
sns.heatmap(corr_rate)
    
plt.title('Correlation between different features in rate category')
Out[152]:
Text(0.5, 1, 'Correlation between different features in rate category')

From the heat map it seems that there is't that much redundancy in this category. Again, let's remove one of two features that have a correlation higher than 0.9:

In [153]:
columns = np.full((corr_rate.shape[0],), True, dtype=bool)
for i in range(corr_rate.shape[0]):
    for j in range(i+1, corr_rate.shape[0]):
        if corr_rate.iloc[i,j] >= 0.9:
            if columns[j]:
                columns[j] = False
selected_columns = df_rate.columns[columns]
df_rate = df_rate[selected_columns]
df_rate
Out[153]:
POP_MIZ_RATE LRG_POP_CNTR_RATE MED_POP_CNTR_RATE SML_POP_CNTRE_RATE RRL_RATE LONE_RATE HSG_RATE HSG_ONLY_RATE HSG_PSG_BELOW_BACH_RATE HSG_PSG_BACH_RATE HSG_PSG_ABOVE_BACH_RATE PSG_RATE PSG_BELOW_BACH_RATE PSG_BACH_RATE PSG_ABOVE_BACH_RATE UE_RATE LW_INC_ECON_FAM_RATE LOW_INC_UNATTACHED_RATE RENTER_OVER30_RATE OWNER_OVER30_RATE IMM_RATE IMM_2006_2016_RATE_IMM IMM_2006_2016_RATE_TOTPOP ONE_YR_INT_MIG_RATE GROWTH_RATE POP_0_19_RATE POP_65_RATE EMP_RATE_25_54
0 88.8 59.6 9.0 12.7 18.7 16.4 89.7 26.0 35.5 28.6 9.8 67.9 53.8 31.0 15.2 4.4 8.8 32.9 40.0 16.6 21.9 30.1 6.6 4.3 5.0 22.4 16.9 80.4
1 60.1 34.3 0.0 23.7 41.9 15.1 90.5 25.0 45.2 23.4 6.3 66.2 68.0 21.4 10.6 9.5 5.9 27.6 39.4 10.6 2.4 45.9 1.1 4.4 1.0 19.6 19.4 72.0
2 75.3 57.0 0.0 10.4 32.7 15.9 92.8 23.5 42.0 26.9 7.6 70.3 63.3 24.1 12.5 7.0 6.1 29.6 40.8 12.0 3.0 47.3 1.4 4.3 2.2 19.9 17.9 75.9
3 46.7 0.0 0.0 34.4 65.6 11.7 84.4 31.3 52.7 12.8 2.9 55.4 79.4 13.9 6.7 14.8 5.7 24.6 39.9 8.2 0.9 34.5 0.3 4.4 -1.0 18.0 23.8 63.4
4 43.1 0.0 0.0 53.8 46.2 16.5 89.9 28.4 51.0 16.6 4.0 62.3 74.4 18.1 7.5 12.6 6.8 25.5 41.9 10.2 1.4 34.6 0.5 5.0 -0.4 18.0 23.0 65.7
5 0.0 0.0 0.0 47.4 52.7 14.0 80.9 24.5 56.0 17.0 2.5 62.4 78.0 16.1 5.9 14.1 3.0 16.2 15.6 5.0 2.7 54.5 1.5 3.4 -0.9 23.7 14.0 69.7
6 77.8 0.0 31.3 13.8 54.9 15.7 93.4 29.9 34.9 28.4 6.9 65.9 59.5 28.5 12.1 7.6 6.0 27.6 36.3 11.1 6.4 52.5 3.4 5.8 1.9 21.9 19.4 79.5
7 77.8 0.0 31.3 13.8 54.9 15.7 93.6 30.0 34.8 28.4 6.8 65.9 59.5 28.5 12.1 7.6 6.0 27.6 36.3 11.1 6.4 52.5 3.4 5.8 1.9 21.9 19.4 79.5
8 73.6 34.3 0.0 23.1 42.6 17.3 92.1 28.5 35.7 28.1 7.7 67.2 57.8 29.2 13.0 5.6 7.0 28.9 42.8 12.0 6.1 35.7 2.2 3.7 0.2 20.0 19.9 78.5
9 23.6 0.0 0.0 28.9 71.1 14.8 88.4 33.3 44.0 18.2 4.5 60.6 70.9 20.2 8.9 5.8 6.3 25.4 40.6 11.9 4.6 23.2 1.1 5.4 -1.5 18.7 24.2 77.0
10 78.7 0.0 0.0 43.4 56.6 17.4 88.5 34.6 43.8 17.8 4.0 62.2 69.8 21.6 8.6 5.9 6.5 26.3 42.4 11.3 3.5 26.3 0.9 4.7 -2.2 20.8 21.7 77.7
11 62.1 0.0 0.0 50.8 49.2 22.4 89.6 31.2 42.0 21.6 5.1 65.3 67.6 22.9 9.5 10.3 6.8 26.8 43.7 10.5 2.6 24.3 0.6 2.5 -2.9 19.6 23.6 68.3
12 99.1 74.7 0.0 3.1 22.3 16.5 94.5 25.4 30.5 34.2 9.9 71.8 47.9 35.7 16.4 4.1 7.7 31.6 43.4 13.0 9.0 41.1 3.7 3.2 3.1 20.6 15.9 82.4
13 72.6 14.5 15.8 18.7 51.0 16.2 91.9 32.6 40.0 22.0 5.4 61.8 62.9 26.9 10.2 6.4 6.9 27.6 36.5 10.3 4.6 41.8 1.9 4.5 -0.5 20.2 19.9 77.9
14 87.2 51.9 0.0 8.2 39.9 15.5 91.2 30.6 41.9 22.4 5.2 64.4 62.3 27.1 10.6 5.7 6.5 27.9 38.7 10.6 5.0 46.8 2.3 5.1 2.7 20.1 19.3 81.3
15 74.3 0.0 34.2 23.0 42.8 17.4 93.2 34.6 38.5 21.8 5.1 61.9 62.8 27.3 9.9 5.5 8.3 28.7 36.2 11.2 5.3 40.4 2.1 4.1 -2.6 21.7 19.0 78.1
16 63.7 0.0 34.1 10.0 55.9 14.9 93.5 34.1 33.4 25.4 7.1 63.8 54.9 32.0 13.1 5.6 7.1 27.3 34.9 9.8 6.4 42.6 2.7 5.6 0.3 22.0 17.9 78.3
17 59.2 0.0 0.0 41.0 59.0 14.4 92.0 35.4 43.1 16.9 4.5 55.4 68.1 23.5 8.4 5.6 5.3 27.6 33.5 11.1 3.4 23.2 0.8 3.3 -2.5 18.6 21.7 79.3
18 87.6 0.0 0.0 44.1 55.9 17.7 91.7 24.1 51.9 19.8 3.7 61.1 75.7 16.8 7.4 9.7 7.8 28.1 35.6 9.2 1.6 34.6 0.6 3.8 -6.2 16.2 25.5 71.5
19 45.0 0.0 0.0 31.4 68.6 17.9 87.6 28.7 52.9 16.2 2.2 56.9 74.0 20.2 5.8 8.9 5.7 27.5 36.4 9.2 1.4 40.5 0.6 3.0 -1.8 16.3 23.9 72.7
20 86.0 0.0 0.0 26.0 74.0 19.0 89.0 36.1 41.8 17.4 4.7 54.9 72.8 21.2 6.1 12.1 5.8 21.7 35.5 9.0 1.8 23.4 0.4 3.1 -2.5 19.1 22.4 68.1
21 88.9 59.2 8.6 12.7 19.5 16.8 85.6 17.9 46.1 24.3 11.7 72.8 60.7 25.1 14.2 4.2 7.9 36.0 33.7 12.9 13.7 36.4 5.0 3.9 3.3 21.6 18.3 82.3
22 60.3 0.0 18.7 31.0 50.3 13.0 84.0 13.8 59.0 20.4 6.7 72.8 74.6 18.3 7.1 5.2 3.8 28.4 28.9 9.6 1.3 43.4 0.6 3.9 -1.3 19.2 23.7 82.1
23 84.9 37.7 0.0 26.7 35.6 13.3 85.2 13.0 61.9 19.2 5.9 76.9 75.2 18.2 6.6 4.8 4.1 31.6 28.5 9.6 1.1 42.4 0.5 2.7 0.5 20.3 20.8 80.6
24 93.6 78.7 0.0 7.9 13.5 14.4 90.1 14.1 47.1 25.9 12.9 79.7 59.8 25.1 15.0 2.8 5.4 34.8 30.6 10.3 5.9 52.0 3.1 2.9 4.2 19.6 20.1 87.3
25 80.0 22.5 29.8 13.5 34.3 15.8 79.7 17.2 57.9 18.1 6.8 69.4 74.6 18.3 7.2 4.1 5.8 35.3 30.9 9.9 2.4 42.8 1.0 4.2 2.2 20.2 22.2 81.8
26 82.6 29.5 12.6 19.9 38.0 15.0 82.9 19.8 51.1 18.4 10.7 68.7 68.9 19.7 11.3 3.5 5.5 32.1 32.3 12.1 5.2 39.1 2.0 4.8 3.2 21.4 20.7 83.3
27 100.0 99.9 0.0 0.0 0.1 20.5 91.4 16.7 32.3 31.8 19.1 76.1 43.0 32.8 24.2 5.5 16.4 46.2 36.8 20.2 34.0 38.0 12.9 2.3 2.9 20.7 16.7 77.1
28 89.7 66.4 0.0 9.6 24.1 18.9 82.3 22.6 45.7 22.9 8.7 69.4 57.6 27.1 15.4 4.0 7.8 32.7 35.6 11.8 9.8 38.6 3.8 4.0 3.6 23.3 15.2 83.4
29 57.4 0.0 0.0 53.4 46.6 14.4 77.5 16.7 58.5 18.9 5.9 67.6 75.7 18.0 6.3 3.8 3.9 25.4 28.6 8.8 1.5 44.5 0.7 3.6 0.7 22.6 17.7 81.5
30 62.7 0.0 0.0 67.4 32.6 17.1 77.4 16.0 61.2 16.9 5.8 65.4 78.6 15.6 5.7 7.6 2.9 19.8 23.1 7.1 1.2 43.4 0.5 2.7 -2.4 22.2 17.5 77.4
31 0.0 0.0 0.0 81.4 18.6 12.7 80.6 16.0 64.0 17.6 2.4 70.8 80.7 13.9 5.4 2.8 2.9 18.8 20.2 7.8 1.8 48.0 0.9 3.6 0.7 22.9 14.2 83.9
32 4.6 0.0 0.0 23.6 76.4 17.0 81.3 16.0 60.2 19.6 4.3 65.8 76.4 17.2 6.4 9.7 4.0 22.7 25.0 9.0 1.1 34.2 0.4 2.9 -4.0 16.9 25.1 73.1
33 70.4 31.2 0.0 28.8 40.0 13.2 84.3 15.6 58.3 19.9 6.2 74.0 73.4 19.0 7.6 2.4 3.0 24.4 25.4 8.6 1.8 41.8 0.7 3.6 2.2 21.8 20.1 88.6
34 100.0 98.8 0.0 0.0 1.2 17.2 86.2 19.8 47.2 24.1 8.9 72.6 58.0 28.7 13.3 4.2 8.8 35.8 35.6 15.3 28.5 27.4 7.8 2.9 5.3 23.4 17.2 83.2
35 94.0 48.3 9.2 18.5 23.9 16.5 78.0 21.3 58.3 16.1 4.4 68.0 76.2 17.7 6.1 3.8 5.6 30.4 34.9 13.1 5.3 27.3 1.5 5.6 4.9 23.0 17.1 84.0
36 83.1 42.1 13.1 16.9 28.0 16.5 78.8 21.2 56.9 16.2 5.7 69.5 71.7 20.5 7.8 3.7 5.7 30.3 36.8 13.9 5.3 27.6 1.5 6.6 5.3 22.4 17.4 84.0
37 97.6 49.9 24.5 10.7 15.0 16.3 83.7 20.7 51.0 20.5 7.8 71.8 63.7 24.8 11.6 3.5 5.7 31.0 32.6 12.0 10.6 33.2 3.5 5.1 4.5 22.8 17.7 85.4
38 0.0 0.0 0.0 36.2 63.8 38.0 38.9 46.2 28.6 19.8 5.5 33.5 74.7 18.2 7.4 11.1 6.5 17.6 3.2 0.0 1.1 26.7 0.3 2.7 9.1 43.4 3.8 68.5
39 0.0 0.0 0.0 44.7 55.3 30.6 42.4 34.3 44.4 14.8 5.6 49.7 84.3 11.8 3.9 11.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.1 4.8 40.2 5.5 69.1
40 95.0 68.1 8.1 10.0 13.8 17.1 91.9 24.8 32.4 31.5 11.4 68.2 48.8 34.0 17.2 4.3 10.1 34.7 45.7 19.8 29.1 24.3 7.1 4.1 4.6 22.5 16.7 80.3
41 87.7 0.0 58.6 14.6 26.7 17.6 88.1 30.3 43.1 22.2 4.4 62.7 66.6 25.8 7.6 5.9 7.5 32.5 44.6 11.2 8.0 9.3 0.7 3.4 -1.5 19.1 23.7 74.6
42 99.5 0.0 72.8 11.5 15.7 18.5 88.4 35.9 40.6 19.0 4.4 59.2 69.3 22.3 8.4 3.4 7.1 30.3 45.4 14.8 12.4 14.1 1.8 5.0 -1.6 23.7 17.6 81.1
43 100.0 79.5 6.1 6.0 8.4 18.2 92.6 27.6 40.4 25.9 6.1 67.3 59.6 29.7 10.7 4.5 6.8 27.7 47.7 19.9 23.6 16.0 3.8 5.4 6.2 24.6 14.4 82.0
44 100.0 0.0 47.0 15.3 37.7 14.7 83.0 37.3 42.9 16.6 3.2 55.8 76.1 18.5 5.4 3.5 6.0 29.3 41.6 14.0 11.9 13.5 1.6 5.7 1.7 25.1 18.1 80.5
45 34.9 0.0 0.0 45.9 54.1 12.0 85.3 30.1 48.9 17.0 4.0 60.3 71.0 21.1 7.9 3.7 4.9 24.2 46.1 16.3 7.9 10.4 0.8 5.9 2.1 20.5 23.9 82.7
46 100.0 0.0 0.0 45.3 54.7 13.4 87.7 33.0 46.7 17.0 3.0 57.7 76.8 17.8 5.4 4.0 3.8 22.6 43.5 14.5 9.8 10.0 1.0 4.9 1.5 21.9 20.7 81.3
47 82.4 0.0 0.0 40.6 59.4 13.1 90.2 34.1 47.5 14.8 3.6 60.2 74.6 19.1 6.3 4.4 4.5 24.7 50.0 17.6 9.3 6.4 0.6 6.7 3.9 17.9 26.4 79.5
48 100.0 86.2 7.7 1.7 4.4 13.4 95.4 21.9 31.8 34.9 11.4 77.4 40.3 40.0 19.8 3.6 6.2 25.3 45.1 18.7 29.6 25.4 7.5 5.3 9.3 26.2 14.9 84.9
49 100.0 91.5 0.0 2.0 6.5 19.2 90.3 26.5 36.5 26.7 10.3 64.5 57.3 28.5 14.2 3.9 12.1 40.7 45.4 16.8 24.7 20.4 5.0 3.3 3.3 22.2 17.3 79.0
50 93.7 0.0 42.0 10.0 48.0 15.3 87.8 36.1 45.1 15.3 3.5 58.0 75.0 18.9 6.0 4.0 6.8 29.8 47.9 16.5 7.6 11.8 0.9 5.8 0.6 20.3 22.6 77.2
51 0.0 0.0 0.0 40.1 59.9 11.0 85.4 32.3 48.2 15.3 4.2 57.6 76.1 17.4 6.5 3.2 4.2 21.4 38.8 14.8 7.2 11.6 0.8 6.4 0.3 22.9 22.7 83.1
52 100.0 0.0 42.7 27.0 30.3 16.8 83.4 31.3 45.7 17.8 5.3 57.4 74.2 19.5 6.3 4.6 7.7 32.1 41.6 12.8 8.7 13.6 1.2 3.0 -2.0 22.4 21.0 78.0
53 97.8 60.8 0.0 4.5 34.7 15.7 92.9 25.7 36.6 24.6 13.0 68.2 57.8 25.3 16.9 4.3 6.4 32.5 48.4 14.8 10.8 17.8 1.9 5.4 0.9 20.2 20.2 79.8
54 87.7 0.0 57.0 15.2 27.8 15.6 89.2 27.8 48.5 17.9 5.9 66.5 73.8 19.3 6.9 5.0 6.4 27.6 45.1 10.7 9.9 11.1 1.1 4.4 0.3 21.0 21.5 78.4
55 81.5 0.0 0.0 40.2 59.8 14.1 87.4 33.1 47.3 15.6 4.0 63.8 72.2 20.8 7.0 4.4 4.7 25.2 48.6 15.9 6.8 10.9 0.7 6.0 2.6 20.0 22.7 81.0
56 99.9 84.2 0.0 6.2 9.6 17.8 90.9 24.7 35.7 27.3 12.2 68.2 54.5 28.6 16.8 4.0 9.7 35.4 45.7 14.3 20.3 24.5 5.0 4.0 3.7 22.4 16.8 79.2
57 100.0 56.7 13.9 16.5 12.8 17.7 91.4 30.7 39.8 22.9 6.6 63.2 63.8 25.4 10.9 4.3 7.4 33.2 48.1 16.6 16.6 15.2 2.5 5.4 3.8 20.6 21.4 79.7
58 59.4 0.0 40.7 13.1 46.2 15.6 87.7 33.1 44.0 18.6 4.5 63.0 70.9 21.4 7.7 6.2 6.7 32.1 48.1 17.1 5.4 15.7 0.8 5.5 -0.8 19.4 22.5 75.4
59 20.3 0.0 0.0 40.4 59.6 19.4 73.8 35.1 40.1 18.9 5.9 52.8 71.4 22.3 6.3 6.2 4.6 20.0 38.4 11.0 4.7 16.3 0.8 4.5 2.3 25.9 16.6 75.7
60 100.0 91.3 0.0 1.8 6.9 16.2 94.2 20.4 28.1 36.1 15.3 77.1 38.3 38.0 23.7 4.2 9.4 34.2 42.3 14.1 23.6 26.9 6.4 2.7 5.8 22.8 15.4 83.1
61 100.0 0.0 36.4 33.8 29.7 14.0 86.5 33.1 47.2 16.9 2.9 57.2 74.2 19.7 6.1 2.6 4.5 23.2 40.9 12.3 9.9 10.5 1.0 6.1 4.9 23.8 18.7 85.2
62 100.0 95.8 0.0 2.4 1.8 17.0 93.7 24.1 30.3 33.8 11.8 67.8 42.7 38.1 19.2 4.7 12.3 36.9 45.3 27.6 51.5 27.4 14.1 3.5 6.5 25.4 12.8 79.8
63 71.4 0.0 40.4 26.9 32.7 13.0 84.2 33.4 44.0 19.3 3.3 55.8 70.5 21.4 8.1 2.4 4.2 23.6 36.3 13.9 8.9 13.5 1.2 4.8 2.2 24.3 18.6 86.5
64 98.2 0.0 59.4 7.3 33.3 16.1 91.1 29.3 44.3 20.6 5.9 66.7 63.2 25.9 10.8 4.5 7.3 35.8 52.3 16.9 8.4 13.0 1.1 6.0 2.5 19.7 22.8 79.3
65 49.7 0.0 0.0 67.2 32.8 17.2 80.8 32.2 46.5 17.2 4.0 58.9 77.0 17.8 5.3 5.1 5.0 24.2 37.6 11.3 2.7 21.8 0.6 3.5 0.0 23.7 16.7 75.7
66 72.1 0.0 0.0 53.3 46.7 13.9 90.8 35.4 43.0 17.5 4.2 60.3 71.4 21.0 7.7 4.8 4.2 22.9 38.8 14.3 5.5 13.8 0.8 5.9 1.0 21.2 20.9 79.9
67 88.9 0.0 22.5 29.7 47.8 14.8 88.2 33.8 46.1 15.7 4.5 59.7 72.4 20.6 6.9 4.0 5.9 28.6 47.0 14.0 5.7 14.2 0.8 4.9 3.2 21.7 19.8 81.4
68 88.5 27.0 5.8 36.6 30.7 16.1 88.9 33.9 41.9 19.8 4.5 60.6 67.6 24.4 8.0 4.1 5.6 25.8 48.9 19.6 12.5 13.4 1.7 7.1 7.2 22.0 18.8 82.1
69 87.4 0.0 44.8 24.3 30.9 17.2 87.1 25.6 47.6 21.3 5.7 67.6 68.1 22.9 9.0 5.1 6.0 29.8 40.7 12.1 5.5 18.1 1.0 2.9 0.9 21.0 19.1 79.3
70 84.1 0.0 61.9 5.3 32.8 19.2 84.6 27.2 38.2 26.6 8.0 63.2 61.9 27.7 10.4 4.7 6.6 29.7 41.8 10.3 7.9 12.6 1.0 3.5 3.1 21.4 18.8 76.9
71 1.8 0.0 0.0 50.0 50.0 14.6 82.7 28.1 50.7 17.6 4.0 60.9 77.8 16.4 5.6 5.3 5.4 29.0 42.1 13.6 3.5 13.9 0.5 5.8 -0.9 20.9 21.8 76.8
72 100.0 87.8 0.0 6.4 5.8 16.0 89.5 26.9 32.2 29.3 11.5 65.4 52.4 31.3 16.3 3.5 6.5 32.0 42.3 14.6 22.6 24.1 5.5 4.9 5.5 24.1 14.4 82.7
73 89.1 46.5 10.8 18.0 24.6 14.1 89.5 29.3 34.4 26.5 9.8 63.9 54.5 30.4 15.1 3.0 4.9 27.2 41.4 17.2 16.4 18.9 3.1 6.0 7.2 24.3 15.8 85.8
74 99.9 72.0 8.3 7.3 12.4 18.4 90.4 29.3 34.7 25.6 10.5 62.3 57.7 28.3 14.0 4.0 9.3 33.6 44.0 12.7 21.9 23.3 5.1 3.5 2.6 23.1 17.7 77.2
75 100.0 88.0 2.9 5.1 4.0 13.8 94.6 21.3 27.7 38.5 12.5 73.3 39.0 41.7 19.4 3.9 11.5 36.4 51.8 28.1 46.8 21.0 9.8 3.9 7.5 24.4 14.6 82.2
76 100.0 100.0 0.0 0.0 0.0 21.2 94.0 19.9 23.5 40.6 16.0 72.1 34.5 41.7 23.8 4.8 17.8 42.3 46.8 27.4 47.0 28.1 13.2 2.5 4.5 19.9 15.6 77.7
77 77.9 55.7 3.8 13.7 26.8 17.0 87.1 36.9 29.9 26.9 6.4 59.8 53.9 33.8 12.2 3.7 9.2 30.8 36.9 11.4 18.3 49.1 9.0 3.4 5.8 25.5 15.6 80.3
78 99.9 98.6 0.0 0.0 1.4 18.3 92.2 32.7 27.3 31.8 8.1 65.7 46.5 38.7 14.8 3.4 11.9 36.5 39.5 12.2 25.2 49.8 12.6 1.9 6.3 23.0 15.7 82.1
79 36.3 0.0 29.2 20.7 50.1 14.7 85.8 41.3 35.2 19.7 3.8 54.4 67.5 24.8 7.8 3.7 6.5 22.0 29.6 9.8 8.9 63.5 5.7 5.0 2.6 25.2 19.1 80.7
80 50.3 0.8 0.0 24.1 75.1 14.3 80.9 43.6 37.4 16.0 3.1 53.8 71.0 22.7 6.3 4.2 4.0 16.5 32.1 11.2 5.8 20.5 1.2 5.4 3.6 24.3 18.3 78.4
81 22.5 0.0 0.0 39.1 60.9 31.9 59.0 55.4 26.7 15.2 2.6 39.3 69.4 23.6 7.0 8.7 7.4 18.2 22.2 6.1 0.0 0.0 0.0 3.7 3.0 38.6 7.3 60.9
82 70.4 0.0 0.0 42.5 57.5 11.0 81.5 44.9 35.2 17.2 2.7 50.3 67.2 25.2 7.6 2.9 5.1 19.2 34.1 10.9 14.0 43.3 6.1 6.1 9.6 31.2 13.4 81.2
83 68.4 41.9 7.2 17.7 33.2 16.4 88.4 36.0 32.7 25.2 6.0 60.4 58.5 30.6 11.0 3.8 6.1 21.9 42.1 12.1 10.5 64.3 6.8 4.7 6.3 25.8 15.5 80.7
84 51.6 0.0 0.0 45.5 54.5 11.1 90.4 41.3 37.7 19.2 1.8 52.9 71.5 23.4 5.2 2.9 3.5 12.8 32.1 9.8 7.4 68.6 5.1 4.5 1.6 25.6 16.8 83.0
85 71.4 0.0 61.0 4.5 34.5 13.6 88.8 38.6 35.0 23.2 3.6 57.1 68.4 24.2 7.4 2.8 5.6 22.0 39.5 10.2 6.7 54.7 3.7 5.7 2.0 23.2 20.1 84.5
86 48.1 0.0 0.0 46.1 53.9 9.9 89.8 39.3 34.5 21.9 3.8 56.2 69.3 25.1 5.6 2.0 4.2 17.4 32.2 9.7 7.0 60.0 4.2 5.3 1.5 23.7 19.5 87.1
87 88.5 77.1 0.0 6.3 16.6 16.9 91.6 34.4 28.8 29.6 7.2 63.7 51.6 35.9 12.5 3.2 6.3 22.0 45.6 11.7 14.0 64.0 8.9 4.0 9.4 24.8 14.4 83.3
88 42.2 0.0 0.0 49.0 51.0 13.2 87.6 41.9 37.7 17.2 3.2 55.7 70.5 23.7 5.9 3.4 4.9 19.5 41.9 11.0 6.0 65.1 3.9 6.4 0.5 23.1 22.6 82.8
89 87.1 71.2 0.0 11.1 17.8 14.8 92.6 32.1 31.0 29.0 7.8 66.0 51.9 33.8 14.3 3.5 6.3 24.9 44.2 13.9 14.1 63.7 9.0 4.5 11.0 24.9 13.9 82.4
90 1.9 0.0 0.0 33.9 66.1 9.1 88.6 33.2 45.3 18.4 3.0 58.0 74.1 20.9 5.0 2.1 3.2 14.4 30.2 9.9 5.1 62.5 3.2 5.1 0.2 24.4 19.7 85.1
91 0.0 0.0 0.0 35.9 64.1 13.2 82.4 45.8 37.7 13.9 2.6 53.5 72.2 21.9 5.9 4.3 5.6 18.0 41.6 11.6 4.6 71.3 3.3 4.9 -3.4 25.3 21.5 77.5
92 67.7 0.0 46.4 3.1 50.5 21.9 78.3 43.1 36.2 16.2 4.5 54.4 69.3 23.4 7.3 5.9 8.5 24.8 42.8 11.9 5.1 60.8 3.1 5.4 0.3 28.1 16.6 73.2
93 55.9 0.0 15.6 30.0 54.3 22.0 81.7 39.5 40.9 15.8 3.8 54.2 71.0 21.9 7.1 6.2 6.6 19.7 38.6 12.8 7.1 77.9 5.5 6.1 6.4 30.6 13.0 73.9
94 0.0 0.0 0.0 46.4 53.6 39.2 52.5 59.0 29.2 10.7 1.1 35.4 70.6 23.3 6.3 14.4 13.4 21.7 29.8 9.2 0.0 0.0 0.0 3.6 3.0 40.7 6.4 50.3
95 0.0 0.0 0.0 0.0 100.0 51.1 58.9 53.1 37.2 8.8 1.8 36.9 79.9 15.9 4.2 14.8 14.2 33.6 29.1 11.8 0.5 40.0 0.2 3.4 -2.1 38.8 7.7 45.4
96 0.0 0.0 0.0 0.0 100.2 32.8 35.9 64.3 28.6 0.0 0.0 24.5 75.5 22.4 4.1 18.9 0.0 16.7 14.3 0.0 0.0 0.0 0.0 4.3 14.6 41.6 5.1 47.0
97 0.0 0.0 0.0 29.1 70.8 42.4 53.7 56.5 32.4 9.8 1.3 35.0 73.8 20.7 5.5 14.9 13.7 26.4 28.8 10.1 0.0 0.0 0.0 3.6 2.2 40.2 6.7 48.5
98 86.3 56.6 12.2 14.8 16.4 14.5 88.9 30.8 33.9 28.2 7.1 65.7 54.0 32.8 13.2 4.7 6.4 23.8 36.0 15.1 21.2 42.0 8.9 4.2 11.6 25.1 12.3 79.9
99 77.5 0.0 51.7 22.5 25.8 14.3 82.8 36.7 38.7 20.6 4.0 56.4 66.5 25.4 8.1 3.8 5.5 22.4 36.4 13.0 12.7 40.4 5.1 5.1 6.3 26.7 15.7 78.7
100 97.1 79.8 3.9 9.8 6.5 13.6 91.6 27.0 29.5 34.7 8.7 70.7 44.7 38.9 16.4 5.2 7.0 26.0 36.6 16.4 27.6 40.9 11.3 3.3 14.0 24.6 11.4 80.7
101 59.0 0.0 25.9 34.3 39.8 14.0 84.1 39.1 42.6 15.3 3.0 57.3 72.7 21.4 5.9 4.6 4.9 16.5 35.7 14.2 8.7 45.0 3.9 6.0 6.5 25.6 15.1 78.0
102 99.8 80.5 2.7 8.5 8.3 15.5 90.4 29.4 33.7 28.9 8.0 67.4 54.0 32.3 13.7 4.3 7.2 26.8 37.8 15.1 23.8 42.1 10.0 4.2 13.9 24.1 12.3 80.5
103 42.1 0.0 29.2 26.1 44.8 15.0 82.4 40.0 40.5 16.3 3.2 55.5 71.9 21.8 6.3 4.4 4.3 13.7 26.9 12.7 9.5 49.9 4.7 5.1 5.9 27.8 10.2 77.5
104 91.3 61.8 12.9 11.7 13.6 15.1 92.1 30.4 31.1 30.1 8.4 66.4 50.6 33.2 16.2 3.6 10.2 35.0 43.3 20.7 28.3 27.0 7.7 5.5 5.6 20.4 18.3 80.2
105 34.9 0.0 0.0 56.6 43.4 12.1 89.5 36.0 40.8 19.6 3.4 62.2 69.2 22.4 8.5 3.8 5.0 23.1 39.1 12.2 9.9 20.8 2.1 6.1 5.5 20.6 21.0 81.2
106 23.3 0.0 0.0 53.2 46.8 14.3 91.2 35.4 44.6 15.8 4.2 64.2 68.0 21.7 10.4 4.3 7.1 27.4 42.9 15.9 11.0 16.3 1.8 7.2 1.4 19.0 23.1 78.6
107 89.9 41.9 22.6 13.0 22.5 13.9 89.3 37.4 38.9 19.6 4.0 62.7 67.1 22.7 10.2 3.9 6.2 31.1 46.3 17.3 13.3 18.6 2.5 6.8 6.0 18.8 24.3 80.6
108 74.4 0.0 35.6 24.9 39.6 14.7 88.4 40.7 36.1 19.4 3.8 58.6 67.4 23.4 9.2 4.5 6.1 28.9 42.4 13.9 9.6 18.6 1.8 6.2 2.3 19.9 21.6 77.6
109 97.2 41.0 36.0 5.2 17.8 15.4 88.7 42.5 35.0 17.7 4.8 54.2 65.9 23.3 10.9 3.9 7.7 33.7 41.5 18.9 20.4 25.2 5.1 6.4 6.7 24.2 18.0 79.0
110 100.0 98.1 0.0 0.0 1.9 15.7 94.4 26.6 31.5 33.9 8.1 71.3 46.4 36.8 16.8 3.2 13.6 40.8 43.3 24.7 39.6 28.2 11.2 6.4 6.2 20.9 14.7 81.0
111 100.0 77.7 11.9 7.1 3.2 14.5 91.7 34.2 33.6 25.3 7.0 61.7 53.6 31.5 14.9 3.3 11.2 36.0 38.6 22.4 36.6 31.0 11.4 5.0 9.3 23.9 15.7 80.7
112 100.0 99.8 0.0 0.0 0.2 16.7 96.3 22.4 25.6 41.8 10.2 72.1 37.3 45.0 17.8 3.5 20.8 45.5 47.1 32.0 60.2 27.7 16.7 3.4 4.1 19.6 17.0 76.8
113 100.0 99.9 0.0 0.0 0.1 16.1 96.8 18.7 23.0 42.9 15.4 76.3 33.1 42.9 24.0 3.0 15.0 41.8 44.7 28.2 42.6 27.4 11.7 3.9 5.0 15.9 15.3 81.3
114 81.1 63.7 0.0 22.5 13.8 14.2 93.9 29.3 30.3 32.0 8.5 74.9 42.3 37.2 20.5 3.0 9.5 31.0 46.5 23.4 29.7 30.0 8.9 5.1 4.8 20.8 19.7 81.3
115 95.9 87.6 0.0 2.4 10.0 14.5 91.4 32.3 29.4 30.7 7.6 69.1 48.8 33.7 17.5 3.0 6.5 32.1 44.5 19.1 18.5 21.5 4.0 5.6 6.5 18.0 21.7 83.2
116 95.6 0.0 34.0 38.0 28.0 15.0 85.0 41.0 36.9 18.3 3.8 60.4 66.7 22.9 10.4 4.4 7.4 30.6 46.9 16.2 14.0 14.6 2.0 6.8 4.8 18.7 25.5 77.3
117 85.4 0.0 65.6 8.7 25.7 14.9 85.3 41.4 38.5 16.6 3.4 59.3 68.1 22.0 9.9 4.5 7.6 28.9 43.3 14.5 11.5 14.7 1.7 7.1 3.3 19.8 23.4 78.0
118 44.1 0.0 0.0 56.5 43.5 18.2 81.4 41.7 39.0 16.1 3.3 54.3 69.6 21.6 8.9 7.0 6.1 20.8 31.7 9.5 10.2 16.7 1.7 5.8 -1.2 24.7 14.8 74.6
119 78.2 0.0 46.6 14.8 38.6 16.9 86.0 42.3 34.0 19.9 3.9 54.6 67.5 22.7 9.8 4.7 6.4 27.2 36.6 9.6 9.3 20.7 1.9 4.6 1.0 23.6 15.2 77.6
120 69.7 0.0 0.0 54.2 45.8 15.1 82.4 40.8 38.6 17.5 3.3 51.7 74.0 20.0 6.0 5.2 4.9 16.0 30.6 11.3 8.3 44.3 3.7 5.6 3.7 27.5 10.0 76.2
121 78.7 0.0 0.0 60.6 39.4 19.0 88.8 32.4 36.6 26.5 4.7 69.3 53.7 31.1 15.2 5.6 0.0 0.0 29.4 12.7 12.6 41.7 5.2 5.6 5.8 22.9 11.9 83.3
122 78.7 0.0 0.0 60.6 39.4 19.0 88.8 32.1 36.4 26.5 4.7 69.3 53.7 31.1 15.2 5.6 0.0 0.0 29.4 12.7 12.6 41.7 5.2 5.6 5.8 22.9 11.9 83.3
123 46.8 0.0 0.0 64.1 35.9 21.7 76.2 37.4 32.0 26.2 4.4 60.0 57.5 30.3 12.3 6.6 0.0 0.0 16.3 9.0 9.0 39.1 3.5 6.3 0.8 27.7 7.7 78.5
124 46.8 0.0 0.0 64.1 35.9 21.7 76.2 37.4 32.0 26.4 4.6 60.0 57.5 30.3 12.3 6.6 0.0 0.0 16.2 9.1 9.0 39.1 3.5 6.3 0.8 27.7 7.7 78.6
125 0.0 0.0 0.0 48.9 51.1 29.0 46.7 46.8 33.4 15.4 4.4 43.0 68.0 21.4 10.6 16.7 0.0 0.0 5.3 7.9 2.6 37.0 1.0 5.1 12.7 41.2 3.8 62.1
126 0.0 0.0 0.0 49.0 51.0 29.1 46.7 46.8 33.4 15.7 4.1 43.1 68.0 21.3 10.7 16.7 0.0 0.0 5.2 8.2 2.6 37.0 1.0 5.1 12.7 41.2 3.8 62.0
In [154]:
df_rate.shape[1]
Out[154]:
28

Out of 36 features in rate category, 8 are highly correlated ( >=0.9), therefore we can reduce the number of features in this category to 28. Then in total we have reduced the number of features to 39:

In [155]:
df39 = pd.concat([df_nrate, df_rate], axis=1)
In [156]:
df39.head()
Out[156]:
Geocode GNR POP_2016 POP_DENSE GNR.1 AVE_PERS_INC AVE_PERS_INC_FEMALE AVE_DWELL MEDIAN_HHLD_INC INC_SHR_BELOW_MEDIAN GOVT_TRNSFR_INC POP_MIZ_RATE LRG_POP_CNTR_RATE MED_POP_CNTR_RATE SML_POP_CNTRE_RATE RRL_RATE LONE_RATE HSG_RATE HSG_ONLY_RATE HSG_PSG_BELOW_BACH_RATE HSG_PSG_BACH_RATE HSG_PSG_ABOVE_BACH_RATE PSG_RATE PSG_BELOW_BACH_RATE PSG_BACH_RATE PSG_ABOVE_BACH_RATE UE_RATE LW_INC_ECON_FAM_RATE LOW_INC_UNATTACHED_RATE RENTER_OVER30_RATE OWNER_OVER30_RATE IMM_RATE IMM_2006_2016_RATE_IMM IMM_2006_2016_RATE_TOTPOP ONE_YR_INT_MIG_RATE GROWTH_RATE POP_0_19_RATE POP_65_RATE EMP_RATE_25_54
0 1 4.0 35151730 3.9 5.1 47487 38632 443058 70332 20.5 11.7 88.8 59.6 9.0 12.7 18.7 16.4 89.7 26.0 35.5 28.6 9.8 67.9 53.8 31.0 15.2 4.4 8.8 32.9 40.0 16.6 21.9 30.1 6.6 4.3 5.0 22.4 16.9 80.4
1 10 4.0 519715 1.4 6.8 45210 34259 243157 67320 24.2 16.5 60.1 34.3 0.0 23.7 41.9 15.1 90.5 25.0 45.2 23.4 6.3 66.2 68.0 21.4 10.6 9.5 5.9 27.6 39.4 10.6 2.4 45.9 1.1 4.4 1.0 19.6 19.4 72.0
2 1011 3.7 313270 16.2 6.5 48517 36888 285746 72211 20.3 13.8 75.3 57.0 0.0 10.4 32.7 15.9 92.8 23.5 42.0 26.9 7.6 70.3 63.3 24.1 12.5 7.0 6.1 29.6 40.8 12.0 3.0 47.3 1.4 4.3 2.2 19.9 17.9 75.9
3 1012 4.1 92690 2.2 6.8 37795 27521 162012 56442 21.9 24.2 46.7 0.0 0.0 34.4 65.6 11.7 84.4 31.3 52.7 12.8 2.9 55.4 79.4 13.9 6.7 14.8 5.7 24.6 39.9 8.2 0.9 34.5 0.3 4.4 -1.0 18.0 23.8 63.4
4 1013 4.7 77685 2.5 8.1 38028 29950 195672 55950 21.5 22.9 43.1 0.0 0.0 53.8 46.2 16.5 89.9 28.4 51.0 16.6 4.0 62.3 74.4 18.1 7.5 12.6 6.8 25.5 41.9 10.2 1.4 34.6 0.5 5.0 -0.4 18.0 23.0 65.7
In [157]:
df39.describe()
Out[157]:
Geocode GNR POP_2016 POP_DENSE GNR.1 AVE_PERS_INC AVE_PERS_INC_FEMALE AVE_DWELL MEDIAN_HHLD_INC INC_SHR_BELOW_MEDIAN GOVT_TRNSFR_INC POP_MIZ_RATE LRG_POP_CNTR_RATE MED_POP_CNTR_RATE SML_POP_CNTRE_RATE RRL_RATE LONE_RATE HSG_RATE HSG_ONLY_RATE HSG_PSG_BELOW_BACH_RATE HSG_PSG_BACH_RATE HSG_PSG_ABOVE_BACH_RATE PSG_RATE PSG_BELOW_BACH_RATE PSG_BACH_RATE PSG_ABOVE_BACH_RATE UE_RATE LW_INC_ECON_FAM_RATE LOW_INC_UNATTACHED_RATE RENTER_OVER30_RATE OWNER_OVER30_RATE IMM_RATE IMM_2006_2016_RATE_IMM IMM_2006_2016_RATE_TOTPOP ONE_YR_INT_MIG_RATE GROWTH_RATE POP_0_19_RATE POP_65_RATE EMP_RATE_25_54
count 127.000000 127.000000 1.270000e+02 127.00000 127.000000 127.000000 127.000000 1.270000e+02 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000
mean 3283.417323 4.827559 8.306347e+05 196.35748 6.077953 44704.582677 36830.984252 3.337558e+05 70044.692913 22.247244 14.324409 68.637008 26.453543 11.937008 25.014961 36.598425 17.221260 84.861417 31.836220 40.233858 21.855118 6.018898 61.746457 64.455906 25.166929 10.404724 5.577953 6.570866 26.101575 37.062205 13.206299 11.667717 31.288976 3.527559 4.681102 3.181890 23.559843 17.446457 77.916535
std 1824.155993 1.512890 3.413698e+06 710.76828 1.824564 7154.292111 5636.364186 2.001311e+05 13465.638646 1.905665 4.502657 32.161415 35.747998 18.712265 19.268862 21.542774 6.033447 11.377366 9.523177 8.934144 6.998863 3.355184 9.364012 11.158075 6.825432 4.686215 3.322265 3.427714 9.061536 9.826034 5.109309 11.537963 17.117276 3.580340 1.231191 3.915768 5.500408 5.002139 7.439532
min 1.000000 2.400000 2.630000e+03 0.00000 3.000000 28882.000000 24812.000000 0.000000e+00 47680.000000 16.400000 5.500000 0.000000 0.000000 0.000000 0.000000 0.000000 9.100000 35.900000 13.000000 23.000000 0.000000 0.000000 24.500000 33.100000 11.800000 3.900000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.900000 -6.200000 15.900000 3.800000 45.400000
25% 2402.500000 3.900000 7.662500e+04 2.10000 4.750000 40592.500000 33632.500000 2.091425e+05 61559.500000 21.450000 11.100000 50.000000 0.000000 0.000000 9.150000 18.650000 14.350000 82.850000 25.650000 33.650000 17.000000 3.800000 57.150000 57.550000 20.650000 6.550000 3.700000 4.900000 22.000000 32.450000 10.250000 4.600000 17.250000 0.900000 3.600000 0.550000 20.250000 15.150000 77.150000
50% 3544.000000 4.500000 1.705350e+05 10.00000 5.700000 43553.000000 35650.000000 2.929320e+05 67548.000000 22.200000 14.300000 78.700000 0.000000 0.000000 22.500000 37.700000 15.800000 88.400000 32.200000 38.700000 19.800000 4.800000 62.400000 68.000000 23.400000 9.000000 4.400000 6.200000 27.500000 38.800000 12.100000 8.400000 27.700000 2.000000 4.800000 2.900000 22.400000 17.900000 79.500000
75% 4706.500000 5.250000 4.670850e+05 38.60000 6.800000 47196.500000 38545.500000 3.679680e+05 74352.500000 23.200000 16.550000 95.300000 56.850000 17.250000 40.300000 51.900000 17.650000 91.200000 36.800000 45.700000 26.500000 7.650000 67.850000 72.750000 29.450000 13.250000 5.850000 7.450000 31.800000 43.450000 15.100000 14.000000 42.250000 5.100000 5.600000 5.550000 24.600000 21.000000 82.100000
max 6201.000000 12.800000 3.515173e+07 4942.60000 13.400000 68678.000000 58918.000000 1.410609e+06 117723.000000 30.500000 28.300000 100.000000 100.000000 72.800000 81.400000 100.200000 51.100000 96.800000 64.300000 64.000000 42.900000 19.100000 79.700000 84.300000 45.000000 24.200000 18.900000 20.800000 46.200000 52.300000 32.000000 60.200000 77.900000 16.700000 7.200000 14.600000 43.400000 26.400000 88.600000
In [158]:
to_multiply = [col for col in df39 if max(df39[col]) > 100]
to_multiply
Out[158]:
['Geocode',
 'POP_2016',
 'POP_DENSE',
 'AVE_PERS_INC',
 'AVE_PERS_INC_FEMALE',
 'AVE_DWELL',
 'MEDIAN_HHLD_INC',
 'RRL_RATE']

Looking at the statistics description of selected features, all features values are almost in the range (0,100) except for POP_2016, POP_DENSE, AVE_PERS_INC, AVE_PERS_INC_FEMALE, AVE_DWELL, MEDIAN_HHLD_INC. We can transform those columns values to fit in the range (0,100). We also add the geographical name for future analysis:

In [159]:
df39['POP_2016_in_million'] = df39['POP_2016'].values/1000000
df39['POP_DENSE_in_100'] = df39['POP_DENSE'].values/100
df39['AVE_PERS_INC_in_1000'] = df39['AVE_PERS_INC'].values/1000
df39['AVE_PERS_INC_FEMALE_in_1000'] = df39['AVE_PERS_INC_FEMALE'].values/1000
df39['MEDIAN_HHLD_INC_in_1000'] = df39['MEDIAN_HHLD_INC'].values/1000
df39['AVE_DWELL_in_100000'] = df39['AVE_DWELL'].values/100000
df39 = df39.drop(['POP_2016','POP_DENSE','AVE_PERS_INC','AVE_PERS_INC_FEMALE','AVE_DWELL','MEDIAN_HHLD_INC'], axis = 1)
df39 = pd.concat([df39, rawdf['Geo Name']], axis=1)
df39.head()
Out[159]:
Geocode GNR GNR.1 INC_SHR_BELOW_MEDIAN GOVT_TRNSFR_INC POP_MIZ_RATE LRG_POP_CNTR_RATE MED_POP_CNTR_RATE SML_POP_CNTRE_RATE RRL_RATE LONE_RATE HSG_RATE HSG_ONLY_RATE HSG_PSG_BELOW_BACH_RATE HSG_PSG_BACH_RATE HSG_PSG_ABOVE_BACH_RATE PSG_RATE PSG_BELOW_BACH_RATE PSG_BACH_RATE PSG_ABOVE_BACH_RATE UE_RATE LW_INC_ECON_FAM_RATE LOW_INC_UNATTACHED_RATE RENTER_OVER30_RATE OWNER_OVER30_RATE IMM_RATE IMM_2006_2016_RATE_IMM IMM_2006_2016_RATE_TOTPOP ONE_YR_INT_MIG_RATE GROWTH_RATE POP_0_19_RATE POP_65_RATE EMP_RATE_25_54 POP_2016_in_million POP_DENSE_in_100 AVE_PERS_INC_in_1000 AVE_PERS_INC_FEMALE_in_1000 MEDIAN_HHLD_INC_in_1000 AVE_DWELL_in_100000 Geo Name
0 1 4.0 5.1 20.5 11.7 88.8 59.6 9.0 12.7 18.7 16.4 89.7 26.0 35.5 28.6 9.8 67.9 53.8 31.0 15.2 4.4 8.8 32.9 40.0 16.6 21.9 30.1 6.6 4.3 5.0 22.4 16.9 80.4 35.151730 0.039 47.487 38.632 70.332 4.43058 Canada
1 10 4.0 6.8 24.2 16.5 60.1 34.3 0.0 23.7 41.9 15.1 90.5 25.0 45.2 23.4 6.3 66.2 68.0 21.4 10.6 9.5 5.9 27.6 39.4 10.6 2.4 45.9 1.1 4.4 1.0 19.6 19.4 72.0 0.519715 0.014 45.210 34.259 67.320 2.43157 Newfoundland and Labrador / Terre-Neuve-et-Lab...
2 1011 3.7 6.5 20.3 13.8 75.3 57.0 0.0 10.4 32.7 15.9 92.8 23.5 42.0 26.9 7.6 70.3 63.3 24.1 12.5 7.0 6.1 29.6 40.8 12.0 3.0 47.3 1.4 4.3 2.2 19.9 17.9 75.9 0.313270 0.162 48.517 36.888 72.211 2.85746 1011 Eastern Regional Health Authority
3 1012 4.1 6.8 21.9 24.2 46.7 0.0 0.0 34.4 65.6 11.7 84.4 31.3 52.7 12.8 2.9 55.4 79.4 13.9 6.7 14.8 5.7 24.6 39.9 8.2 0.9 34.5 0.3 4.4 -1.0 18.0 23.8 63.4 0.092690 0.022 37.795 27.521 56.442 1.62012 1012 Central Regional Health Authority
4 1013 4.7 8.1 21.5 22.9 43.1 0.0 0.0 53.8 46.2 16.5 89.9 28.4 51.0 16.6 4.0 62.3 74.4 18.1 7.5 12.6 6.8 25.5 41.9 10.2 1.4 34.6 0.5 5.0 -0.4 18.0 23.0 65.7 0.077685 0.025 38.028 29.950 55.950 1.95672 1013 Western Regional Health Authority
In [160]:
corr39 = df39.corr()
plt.figure(figsize=(27,27))
sns.heatmap(corr39, 
            xticklabels=corr39.columns.values,
            yticklabels=corr39.columns.values, annot=True, cmap='cubehelix', square=True)
plt.title('Correlation between 39 reduced features')
Out[160]:
Text(0.5, 1, 'Correlation between 39 reduced features')

To further analysis of the data we work with 39 selected features. We also pick the data for Canada, provinces and territories and store in df14 and do some analysis for Canada and provinces. We also set the index of df to be the Geocode.

In [161]:
df39.set_index("Geocode", inplace=True)
In [162]:
Geocode_14 = [1, 10, 11, 12, 13, 24, 35, 46, 47, 48, 59, 60, 61, 62]
Geoname_14 = ['Canada', 'Newfoundland and Labrador', 'Prince Edward Island', 'Nova Scotia', 'New Brunswick', 'Quebec', 'Ontario', 'Manitoba', 'Saskatchewan', 'Alberta', 'British Columbia', 'Yukon', 'Northwest Territories', 'Nunavut']
Geo14 = pd.DataFrame({'Geocode': Geocode_14, 'Geoname': Geoname_14})
In [163]:
df14 = df39.loc[Geocode_14]
df14
Out[163]:
GNR GNR.1 INC_SHR_BELOW_MEDIAN GOVT_TRNSFR_INC POP_MIZ_RATE LRG_POP_CNTR_RATE MED_POP_CNTR_RATE SML_POP_CNTRE_RATE RRL_RATE LONE_RATE HSG_RATE HSG_ONLY_RATE HSG_PSG_BELOW_BACH_RATE HSG_PSG_BACH_RATE HSG_PSG_ABOVE_BACH_RATE PSG_RATE PSG_BELOW_BACH_RATE PSG_BACH_RATE PSG_ABOVE_BACH_RATE UE_RATE LW_INC_ECON_FAM_RATE LOW_INC_UNATTACHED_RATE RENTER_OVER30_RATE OWNER_OVER30_RATE IMM_RATE IMM_2006_2016_RATE_IMM IMM_2006_2016_RATE_TOTPOP ONE_YR_INT_MIG_RATE GROWTH_RATE POP_0_19_RATE POP_65_RATE EMP_RATE_25_54 POP_2016_in_million POP_DENSE_in_100 AVE_PERS_INC_in_1000 AVE_PERS_INC_FEMALE_in_1000 MEDIAN_HHLD_INC_in_1000 AVE_DWELL_in_100000 Geo Name
Geocode
1 4.0 5.1 20.5 11.7 88.8 59.6 9.0 12.7 18.7 16.4 89.7 26.0 35.5 28.6 9.8 67.9 53.8 31.0 15.2 4.4 8.8 32.9 40.0 16.6 21.9 30.1 6.6 4.3 5.0 22.4 16.9 80.4 35.151730 0.039 47.487 38.632 70.332 4.43058 Canada
10 4.0 6.8 24.2 16.5 60.1 34.3 0.0 23.7 41.9 15.1 90.5 25.0 45.2 23.4 6.3 66.2 68.0 21.4 10.6 9.5 5.9 27.6 39.4 10.6 2.4 45.9 1.1 4.4 1.0 19.6 19.4 72.0 0.519715 0.014 45.210 34.259 67.320 2.43157 Newfoundland and Labrador / Terre-Neuve-et-Lab...
11 4.1 5.6 23.2 17.6 77.8 0.0 31.3 13.8 54.9 15.7 93.4 29.9 34.9 28.4 6.9 65.9 59.5 28.5 12.1 7.6 6.0 27.6 36.3 11.1 6.4 52.5 3.4 5.8 1.9 21.9 19.4 79.5 0.142910 0.251 38.899 34.171 61.026 1.97966 Prince Edward Island / Île-du-Prince-Édouard
12 3.9 5.5 30.5 15.3 73.6 34.3 0.0 23.1 42.6 17.3 92.1 28.5 35.7 28.1 7.7 67.2 57.8 29.2 13.0 5.6 7.0 28.9 42.8 12.0 6.1 35.7 2.2 3.7 0.2 20.0 19.9 78.5 0.923600 0.174 41.479 34.132 60.745 2.30441 Nova Scotia / Nouvelle-Écosse
13 3.9 4.9 26.3 16.9 72.6 14.5 15.8 18.7 51.0 16.2 91.9 32.6 40.0 22.0 5.4 61.8 62.9 26.9 10.2 6.4 6.9 27.6 36.5 10.3 4.6 41.8 1.9 4.5 -0.5 20.2 19.9 77.9 0.747100 0.105 39.141 32.504 59.313 1.70071 New Brunswick / Nouveau-Brunswick
24 3.7 4.3 27.6 15.6 88.9 59.2 8.6 12.7 19.5 16.8 85.6 17.9 46.1 24.3 11.7 72.8 60.7 25.1 14.2 4.2 7.9 36.0 33.7 12.9 13.7 36.4 5.0 3.9 3.3 21.6 18.3 82.3 8.164360 0.060 42.546 36.203 59.808 2.90484 Quebec / Québec
35 3.7 4.6 22.4 11.1 95.0 68.1 8.1 10.0 13.8 17.1 91.9 24.8 32.4 31.5 11.4 68.2 48.8 34.0 17.2 4.3 10.1 34.7 45.7 19.8 29.1 24.3 7.1 4.1 4.6 22.5 16.7 80.3 13.448490 0.148 47.915 39.585 74.260 5.06409 Ontario
46 4.3 5.3 25.7 11.9 77.9 55.7 3.8 13.7 26.8 17.0 87.1 36.9 29.9 26.9 6.4 59.8 53.9 33.8 12.2 3.7 9.2 30.8 36.9 11.4 18.3 49.1 9.0 3.4 5.8 25.5 15.6 80.3 1.278365 0.023 43.767 36.747 68.070 2.97444 Manitoba
47 4.8 6.3 21.6 10.5 68.4 41.9 7.2 17.7 33.2 16.4 88.4 36.0 32.7 25.2 6.0 60.4 58.5 30.6 11.0 3.8 6.1 21.9 42.1 12.1 10.5 64.3 6.8 4.7 6.3 25.8 15.5 80.7 1.098350 0.019 49.409 40.184 75.388 3.18917 Saskatchewan
48 4.7 6.1 20.6 6.7 86.3 56.6 12.2 14.8 16.4 14.5 88.9 30.8 33.9 28.2 7.1 65.7 54.0 32.8 13.2 4.7 6.4 23.8 36.0 15.1 21.2 42.0 8.9 4.2 11.6 25.1 12.3 79.9 4.067175 0.064 62.778 45.535 93.931 4.49790 Alberta
59 4.9 6.1 20.5 11.1 91.3 61.8 12.9 11.7 13.6 15.1 92.1 30.4 31.1 30.1 8.4 66.4 50.6 33.2 16.2 3.6 10.2 35.0 43.3 20.7 28.3 27.0 7.7 5.5 5.6 20.4 18.3 80.2 4.648055 0.050 45.616 36.901 69.979 7.20689 British Columbia / Colombie-Britannique
60 6.7 6.8 22.6 8.9 78.7 0.0 0.0 60.6 39.4 19.0 88.8 32.4 36.6 26.5 4.7 69.3 53.7 31.1 15.2 5.6 0.0 0.0 29.4 12.7 12.6 41.7 5.2 5.6 5.8 22.9 11.9 83.3 0.035875 0.001 53.809 50.023 84.631 3.71716 Yukon
61 7.5 8.8 22.1 7.4 46.8 0.0 0.0 64.1 35.9 21.7 76.2 37.4 32.0 26.2 4.4 60.0 57.5 30.3 12.3 6.6 0.0 0.0 16.3 9.0 9.0 39.1 3.5 6.3 0.8 27.7 7.7 78.5 0.041785 0.000 64.586 58.918 117.723 3.46427 Northwest Territories / Territoires du Nord-Ou...
62 7.9 8.7 20.7 11.9 0.0 0.0 0.0 48.9 51.1 29.0 46.7 46.8 33.4 15.4 4.4 43.0 68.0 21.4 10.6 16.7 0.0 0.0 5.3 7.9 2.6 37.0 1.0 5.1 12.7 41.2 3.8 62.1 0.035940 0.000 50.689 50.492 97.441 3.52070 Nunavut

Below is the histogram of 39 selected features where we can have a quick look at the overall distribution of the features and also detect the presence of possible outliers.

In [164]:
import matplotlib
params = {'axes.titlesize':'40',
          'xtick.labelsize':'24',
          'ytick.labelsize':'24'}
matplotlib.rcParams.update(params)
df39.hist(figsize=(80,60))
plt.tight_layout()

From the above histograms we can easily detect some of the outliers; for example:

We expect to have outliers in 2016 population, while about 120 (out of 127) of regions in the list have population less than 5 million, a few (about 7) have population 5-15 million. Note 35 million is the population of Canada. The same is true for population density; while for most regions population density is less than 500 per square kilometer, few have density of 500-1000 (or even 3500-5000) per square kilometer!

In most regions, lone-parent family rate is between 10-20 percent of total census family population while for a few the rate could be as high as 50% of total census family population.

High school graduate rates in most regions are more than 70%, however for a few regions is as low as 20-50%.

We also have some outliers in Average dwelling with average value of dwelling as high as 1200,000 to 1400,000.

In this following we investigate the distribution of some selected features and flag some regions as potential outliers.

In [165]:
data_dictionary.set_index("Feature", inplace=True)
data_dictionary = data_dictionary.rename(index={'GROWTH':'GROWTH_RATE','POP_2016':'POP_2016_in_million','POP_DENSE':'POP_DENSE_in_100','AVE_PERS_INC':'AVE_PERS_INC_in_1000' ,'AVE_PERS_INC_FEMALE':'AVE_PERS_INC_FEMALE_in_1000','AVE_DWELL':'AVE_DWELL_in_100000','MEDIAN_HHLD_INC':'MEDIAN_HHLD_INC_in_1000'})
In [166]:
col = 'POP_DENSE_in_100'
plt.figure(figsize = (8, 40))
g = df39.sort_values(by=[col])
plt.barh(df39["Geo Name"][g.index], g[col].values) 
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title(data_dictionary.loc[col, 'Description'], fontsize=25)
plt.show()
In [174]:
col = 'POP_DENSE_in_100'
plt.figure(figsize = (18, 8))
g = df14.sort_values(by=[col])
plt.barh(df14["Geo Name"][g.index], g[col].values) 
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title(data_dictionary.loc[col, 'Description'], fontsize=25)
plt.show()

The first 10 regions in the above are potential outliers among regions where Vancouver, Toronto and Montreal particularly stand out with population density 4000-5000 per square kilometer. Among provinces PEI stands out with the population density 25 per square kilometer.

In [170]:
col = 'POP_2016_in_million'
plt.figure(figsize = (18, 8))
g = df14.sort_values(by=[col])
plt.barh(df14["Geo Name"][g.index], g[col].values) 
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title('2016 Population in million', fontsize=25)
plt.show()

Below we can see the sorted distribution for Lone parent families for all regions and for provinces. Keewatin with 50% stands out among regions and Nunavut with close to 30% among provinces and territories.

In [171]:
col = 'LONE_RATE'
plt.figure(figsize = (8, 40))
g = df39.sort_values(by=[col])
plt.barh(df39["Geo Name"][g.index], g[col].values) 
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title(data_dictionary.loc[col, 'Description'], fontsize=25)
plt.show()
In [169]:
col = 'LONE_RATE'
plt.figure(figsize = (18, 8))
g = df14.sort_values(by=[col])
plt.barh(df14["Geo Name"][g.index], g[col].values) 
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title(data_dictionary.loc[col, 'Description'], fontsize=25)
plt.show()

In distribution chart below, we see 4 regions at the end of the chart with high school graduates rate less than 50% stand out.

In [172]:
col = 'HSG_RATE'
plt.figure(figsize = (8, 40))
g = df39.sort_values(by=[col])
plt.barh(df39["Geo Name"][g.index], g[col].values) 
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title(data_dictionary.loc[col, 'Description'], fontsize=25)
plt.show()
In [170]:
col = 'HSG_PSG_BACH_RATE'
plt.figure(figsize = (18, 8))
g = df14.sort_values(by=[col])
plt.barh(df14["Geo Name"][g.index], g[col].values) 
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title(data_dictionary.loc[col, 'Description'], fontsize=25)
plt.show()
 
In [180]:
col = 'PSG_ABOVE_BACH_RATE'
plt.figure(figsize = (18, 8))
g = df14.sort_values(by=[col])
plt.barh(df14["Geo Name"][g.index], g[col].values) 
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title(data_dictionary.loc[col, 'Description'], fontsize=25)
plt.show()
 
In [171]:
col = 'IMM_RATE'
plt.figure(figsize = (18, 8))
g = df14.sort_values(by=[col])
plt.barh(df14["Geo Name"][g.index], g[col].values) 
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title(data_dictionary.loc[col, 'Description'], fontsize=25)
plt.show()

Nunavut with long-term unemployment rate over 16% is an outlier among provinces/territories.

In [172]:
col = 'UE_RATE'
plt.figure(figsize = (18, 8))
g = df14.sort_values(by=[col])
plt.barh(df14["Geo Name"][g.index], g[col].values) 
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title(data_dictionary.loc[col, 'Description'], fontsize=25)
plt.show()
In [182]:
col = 'EMP_RATE_25_54'
plt.figure(figsize = (18, 8))
g = df14.sort_values(by=[col])
plt.barh(df14["Geo Name"][g.index], g[col].values) 
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title(data_dictionary.loc[col, 'Description'], fontsize=25)
plt.show()
In [175]:
col = 'LRG_POP_CNTR_RATE'
plt.figure(figsize = (18, 8))
g = df14.sort_values(by=[col])
plt.barh(df14["Geo Name"][g.index], g[col].values) 
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title(data_dictionary.loc[col, 'Description'], fontsize=25)
plt.show()

PEI is again an outlier for medium population centers among provinces and territories.

In [176]:
col = 'MED_POP_CNTR_RATE'
plt.figure(figsize = (18, 8))
g = df14.sort_values(by=[col])
plt.barh(df14["Geo Name"][g.index], g[col].values) 
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title(data_dictionary.loc[col, 'Description'], fontsize=25)
plt.show()
In [178]:
col = 'SML_POP_CNTRE_RATE'
plt.figure(figsize = (18, 8))
g = df14.sort_values(by=[col])
plt.barh(df14["Geo Name"][g.index], g[col].values) 
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title(data_dictionary.loc[col, 'Description'], fontsize=25)
plt.show()
In [185]:
col = 'RRL_RATE'
plt.figure(figsize = (18, 8))
g = df14.sort_values(by=[col])
plt.barh(df14["Geo Name"][g.index], g[col].values) 
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title(data_dictionary.loc[col, 'Description'], fontsize=25)
plt.show()

Nunavut and Alberta with growth rate over 11% and New Brunswick with negative growth rate are potential outliers.

In [344]:
col = 'GROWTH_RATE'
plt.figure(figsize = (18, 8))
g = df14.sort_values(by=[col])
plt.barh(df14["Geo Name"][g.index], g[col].values) 
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title(data_dictionary.loc[col, 'Description'], fontsize=25)
plt.show()
In [348]:
col = 'AVE_PERS_INC_in_1000'
plt.figure(figsize = (18, 8))
g = df14.sort_values(by=[col])
plt.barh(df14["Geo Name"][g.index], g[col].values) 
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title('Average total income (in 1000) in 2015 of population 15 years and over', fontsize=25)
plt.show()

Vancouver, North Shore and Richmond with average value of dwelling about or over a million dollars definitely stand out among the regions. While British Columbia is an outlier among provinces and territories.

In [177]:
col = 'AVE_DWELL_in_100000'
plt.figure(figsize = (18,40))
g = df39.sort_values(by=[col])
plt.barh(df39["Geo Name"][g.index], g[col].values) 
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title('Average value of dwelling in 100,000', fontsize=25)
plt.show()
In [350]:
col = 'AVE_DWELL_in_100000'
plt.figure(figsize = (18, 8))
g = df14.sort_values(by=[col])
plt.barh(df14["Geo Name"][g.index], g[col].values) 
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title('Average value of dwelling in 100,000', fontsize=25)
plt.show()
In [460]:
col = 'LRG_POP_CNTR_RATE'
plt.figure(figsize = (10, 40))
g = df39.sort_values(by=[col])
plt.barh(df39["Geo Name"][g.index], g[col].values) 
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title(data_dictionary.loc[col,'Description'], fontsize=25)
plt.show()

In this section we perform bivariate analysis for a selection of features.

8.1 Pairwise scatter plots with emphasis on growth rate

Let's pick up 7 features with correlation higher than 0.44 with GROWTH RATE and look at their pairwise scatter plot.

In [178]:
#Correlation with output variable
corr_target = abs(corr39["GROWTH_RATE"])
a = corr_target.sort_values(ascending=False)[1:12]
print('the 5-top features with high correlation with growth rate are : \n \n {} '.format(a))
the 5-top features with high correlation with growth rate are : 
 
 GOVT_TRNSFR_INC                0.596805
MEDIAN_HHLD_INC_in_1000        0.562362
POP_65_RATE                    0.552925
IMM_2006_2016_RATE_TOTPOP      0.504974
AVE_PERS_INC_FEMALE_in_1000    0.495273
HSG_PSG_BELOW_BACH_RATE        0.495108
PSG_BACH_RATE                  0.473036
AVE_PERS_INC_in_1000           0.465486
LRG_POP_CNTR_RATE              0.464053
PSG_BELOW_BACH_RATE            0.448703
AVE_DWELL_in_100000            0.446769
Name: GROWTH_RATE, dtype: float64 
In [179]:
dfbp = df39.drop(axis = 0, index = 1)
dfbp = dfbp.drop('Geo Name', axis = 1)
In [180]:
dfGR = dfbp[{'GROWTH_RATE','GOVT_TRNSFR_INC', 'POP_65_RATE','IMM_2006_2016_RATE_TOTPOP','PSG_BACH_RATE', 'AVE_PERS_INC_in_1000', 'LRG_POP_CNTR_RATE', 'AVE_DWELL_in_100000'}]
sns.pairplot(dfGR)
Out[180]:
<seaborn.axisgrid.PairGrid at 0x1395417b8>

The growth rate is almost linearly increasing(decreasing) with average person income (government transfer income). It makes sense, because government transfer income rate is decreasing when average person income increases; the government transfer income is aimed to support proportion of population with lower income. However, government transfer income seems to decrease exponentially with increasing average income not linearly.

Immigration rate from 2006-2016 and post secondary graduates rate with bachelor's degree are also linearly dependent with positive correlation. We can conclude most immigrants to Canada have university degree; this was expected as Canadian universities hire lots of international graduate students. That could also explain why growth rate increases with postgraduates rate with bachelor's degree as well as positive correlation between post graduate rate and large urban population centers; large urban population centers have one or more universities.

Average person income is higher in large urban population centers, also it increases with postgraduates rate which is expected; jobs that require postgraduate degrees are better paid. Average income rate decreases with population over 65 rate, it is expected too as 65 is the retirement age. Average values of dwelling increases with average income as expected; an expensive house is affordable for a higher income person.

Average values of dwelling is higher in large urban population centers and is lower for population over 65; retired people move to smaller/quieter and (probably) less expensive towns.

8.2 Pairwise scatter plots with emphasis on low income rate

We repeat the previous section but this time we pick up features with high correlation with low income rate.

In [475]:
#Correlation with output variable
corr_target = abs(corr39['LW_INC_ECON_FAM_RATE'])
a = corr_target.sort_values(ascending=False)[1:16]
print('the 5-top features with high correlation with growth rate are : \n \n {} '.format(a))
the 5-top features with high correlation with growth rate are : 
 
 LOW_INC_UNATTACHED_RATE      0.786011
IMM_RATE                     0.662422
OWNER_OVER30_RATE            0.643220
IMM_2006_2016_RATE_TOTPOP    0.584356
POP_DENSE_in_100             0.584289
SML_POP_CNTRE_RATE           0.582575
LRG_POP_CNTR_RATE            0.582133
PSG_BELOW_BACH_RATE          0.546312
PSG_BACH_RATE                0.530274
PSG_ABOVE_BACH_RATE          0.522426
AVE_DWELL_in_100000          0.518643
HSG_PSG_ABOVE_BACH_RATE      0.517115
HSG_PSG_BACH_RATE            0.489353
RENTER_OVER30_RATE           0.485369
RRL_RATE                     0.448423
Name: LW_INC_ECON_FAM_RATE, dtype: float64 
In [181]:
dfLIn = dfbp[{'LW_INC_ECON_FAM_RATE','POP_DENSE_in_100','SML_POP_CNTRE_RATE', 'AVE_DWELL_in_100000', 'RRL_RATE', 'LRG_POP_CNTR_RATE'}]
sns.pairplot(dfLIn)
Out[181]:
<seaborn.axisgrid.PairGrid at 0x131560ac8>

Low income rate is decreasing with rural and small population rate and increasing with large urban population rate: Families in high low income rate are less likely to live in the rural areas or small population centers and most likely to live in the large urban population centers. We can further infer that most low income families lives in dwellings with lower average values.

We also see where the population density is high, the low income family rate is high too which again support that most families with lower income live in large urban centers that have higher population density since the population density is high in large urban centers.

Population density in general is 0 (or not available) for most regions in the dataset. This was evident in the barplot of population density in previous section where Vancouver and Toronto were the most dense regions in the list. From the population density vs average values of dwelling it is obvious that Vancouver, which has the highest density, have also the highest rate of average value of dwelling.

In rural areas and small population centers, mostly the average values of dwellings are lower.

The source for the dataset is Statistics Canada therefore reliable. In preparing data dictionary we compared values of all columns for Canada with the data from statistics Canada website and they all match except for the female and lone parent families and male and female rate. We spotted a mistake on the website which was the definition of female and male lone parent families, those columns do not match our dataset as the rates are proportion of lone parent families while we believe our data is proportion of census families. The data for these columns in our dataset for Canada seems to be reasonable.

We addressed some of potential outliers while doing univariate/bivariate analysis. We can also detect outliers using boxplot as follows:

In [182]:
col1 = [col for col in dfbp if max(dfbp[col]) <= 40]
dfbp1 = dfbp[col1]
dfbp2 = dfbp.drop(col1, axis = 1)
dfbp2.shape[1]
Out[182]:
23
In [183]:
dfbp1.boxplot(figsize=(15,10),grid=False, vert=False, fontsize=10)
plt.tight_layout()
In [184]:
dfbp2.boxplot(figsize=(15,10),grid=False, vert=False, fontsize=10)
plt.tight_layout()

We only did minor transformations for some features with large values like population, incomes and dwelling values only to keep them in the same range as values for other columns. It wasn't crucial though.